Literature DB >> 31158315

The Practical Significance of Measurement Error in Pulmonary Function Testing Conducted in Research Settings.

Abstract

Conventional spirometry produces measurement error by using repeatability criteria (RC) to discard acceptable data and terminating tests early when RC are met. These practices also implicitly assume that there is no variation across maneuvers within each test. This has implications for air pollution regulations that rely on pulmonary function tests to determine adverse effects or set standards. We perform a Monte Carlo simulation of 20,902 tests of forced expiratory volume in 1 second (FEV1 ), each with eight maneuvers, for an individual with empirically obtained, plausibly normal pulmonary function. Default coefficients of variation for inter- and intratest variability (3% and 6%, respectively) are employed. Measurement error is defined as the difference between results from the conventional protocol and an unconstrained, eight-maneuver alternative. In the default model, average measurement error is shown to be ∼5%. The minimum difference necessary for statistical significance at p < 0.05 for a before/after comparison is shown to be 16%. Meanwhile, the U.S. Environmental Protection Agency has deemed single-digit percentage decrements in FEV1 sufficient to justify more stringent national ambient air quality standards. Sensitivity analysis reveals that results are insensitive to intertest variability but highly sensitive to intratest variability. Halving the latter to 3% reduces measurement error by 55%. Increasing it to 9% or 12% increases measurement error by 65% or 125%, respectively. Within-day FEV1 differences ≤5% among normal subjects are believed to be clinically insignificant. Therefore, many differences reported as statistically significant are likely to be artifactual. Reliable data are needed to estimate intratest variability for the general population, subpopulations of interest, and research samples. Sensitive subpopulations (e.g., chronic obstructive pulmonary disease or COPD patients, asthmatics, children) are likely to have higher intratest variability, making it more difficult to derive valid statistical inferences about differences observed after treatment or exposure.

Entities: Chemical Disease Gene Species

Keywords: FEV1; information quality; intertest variability; intratest variability; measurement error

Year: 2019 PMID： 31158315 PMCID： PMC6851780 DOI： 10.1111/risa.13315

Source DB: PubMed Journal: Risk Anal ISSN： 0272-4332 Impact factor: 4.000

PULMONARY FUNCTION DATA AS USED IN FEDERAL AIR POLLUTION REGULATIONS

In its 2008 National Ambient Air Quality Standard (NAAQS) for ozone (O3), the U.S. Environmental Protection Agency (USEPA, 2008) relied on a chamber study of 30 healthy, exercising young adults in which a transient group mean decrement in forced expiratory volume in 1 second (FEV1) of 2.9% was observed after 6.6 hours exposure to 60 ppb O3 (Adams, 2006a). USEPA was particularly concerned that two subjects experienced transient FEV1 decrements >10% (Brown, 2007). Transient group FEV1 decrements ∼5% and transient individual FEV1 decrements >10% were deemed to be important for defining adverse effects and for setting national regulations. Prior to the 2015 O3 NAAQS, a new chamber study of 31 healthy young adults was performed in which statistically significant transient group FEV1 decrements were observed for at least one exposure duration at 70 ppb, 80 ppb, and 87 ppb, but not at 60 ppb. In percentage terms, the largest group mean decrements were 5% (70 ppb), 7% (80 ppb), and 11% (87 ppb); some subjects experienced decrements >10% (Schelegle, Morales, Walby, Marion, & Allen, 2009). Based on this study, USEPA concluded that “the results of controlled human exposure studies strongly support setting the level of a revised O3 standard no higher than 70 ppb” (USEPA, 2015, p. 65353). Thus, federal air pollution policy considers transient FEV1 decrements that exceed 10%, or are determined to be statistically significant regardless of magnitude, as convincing evidence of adverse health effects. It is therefore important to explore whether the test methods used to measure these decrements are appropriate and reliable for that purpose.

SPIROMETRIC PROTOCOLS

Spirometry in air pollution research settings follows protocols established by the American Thoracic Society (ATS) or the European Respiratory Society (ERS) for use in clinical settings (ATS, 1979, 1987, 1995; Miller, Hankinson et al., 2005; National Health and Nutrition Examination Survey [NHANES], 2008). A “maneuver” is performed when a subject inhales deeply and blows hard into a tube connected to a spirometer. Peak expiratory flow rate (PEF), FEV1, forced vital capacity (FVC), and other measurements are calculated. Maximum subject performance is desired but constrained by examiner skill (Enright, Beck, & Sherrill, 2004); subjects’ ability, posture, and cooperativeness; physical setting; season; time of day; and a host of other factors (ATS, 1979, 1987, 1995; Miller, Crapo et al., 2005; Redlich et al., 2014; Stocks, Kirkby, & Lum, 2014), including inherent uncertainty and variability. Tests may be quality graded A through F (Enright et al., 2004; Enright, Johnson, Connett, Voelker, & Buist, 1991; Enright, Skloot, Cox‐Ganser, Udasin, & Herbert, 2010). The ATS protocol requires that three to eight maneuvers be performed for each test. The average is unlikely to change appreciably after three maneuvers, but the maximum will increase until gains from practice (Enright, 2003) are outweighed by subject fatigue (Miller, Hankinson et al., 2005). For a specific patient in a clinical setting, this appears to be sufficient. For air pollution research, however, eight maneuvers may not be optimal; improvement has been shown after the eighth maneuver (Lehmann, Vollset, Nygaard, & Gulsvik, 2004), and up to 30 maneuvers may be required to obtain “best” performance in young children (Aurora et al., 2004). This article quantifies measurement error resulting from unreported or unmeasured within‐subject variation across tests (intertest variability) and, more importantly, within‐subject variation across maneuvers within a single test (called intratest or within‐test (Enright et al., 2011; Kainu, Lindqvist, Sarna, & Sovijärvi, 2008; Kainu, Lindqvist, Sarna, Lundbäck, & Sovijärvi, 2008), within‐occasion (Aurora et al., 2004; Beydon et al., 2007a), intrasession (Kainu, 2008; Kainu, Lindqvist, Sarna, & Sovijärvi, 2008; Kainu, Lindqvist, Sarna, Lundbäck et al., 2008), or intrameasurement repeatability (Beydon et al., 2007a)). Measurement error is shown to be within the bounds of what clinicians consider not biologically meaningful but greater than performance decrements that air pollution researchers report as statistically significant.

Repeatability Criteria and Early Test Termination

Some maneuvers are discarded due to technical deficiencies; only acceptable maneuvers meeting repeatability criteria (RC) are retained and potentially reported (Miller, Hankinson et al., 2005). The stated purpose of RC is to “improve confidence in the diagnostic discrimination of the test and the confidence in which changes in lung function may be interpreted by the physician” (Enright et al., 2004, p. 236). Maneuvers are deemed “repeatable” if the difference between the highest and second‐highest FVC and FEV1 are within the RC. If differences exceed the RC, up to eight maneuvers are performed until the RC are met. RC have been a feature of ATS/ERS protocols since 1979, though the choice of RC has changed. It was set at 0.10 L/sec in 1979 (ATS, 1979), retained there in 1987 after a review (ATS, 1987), widened to 0.20 L/sec in 1995 (ATS, 1995), and narrowed to 0.15 L/sec in 2005 (Miller, Hankinson et al., 2005). There are no objective standards for choosing the “right” RC. ATS/ERS guidance says within‐day differences in FEV1 ≤ 5%, week‐to‐week differences ≤11%, and year‐to‐year differences ≤15% for normal subjects should not be interpreted as clinically meaningful; greater differences apply to chronic obstructive pulmonary disease (or COPD) patients (within‐day ≤13%; week‐to‐week ≤20%) (Pellegrino, Viegi et al., 2005). Similar interpretative guidance has been published for occupational spirometry (American College of Occupational and Envirionmental Physicians, 2016; Redlich et al., 2014). How these thresholds were determined is not reported, and while it is plausible that they implicitly account for intertest variability, there is no basis for inferring that they also account for intratest variability (Hnizdo et al., 2007). Spirometric protocols also include provisions for early test termination once RC have been met (ATS, 1979, 1987, 1995; Miller, Hankinson et al., 2005). This practice results in the failure to collect readily available, acceptable data, and the retention of “maximum” values that often are not unconstrained test maxima. Even if three maneuvers are sufficient to produce a pair of values satisfying the RC, more maneuvers often produce additional such pairs, some of which may have maxima greater than the maximum of the initial qualifying pair that led to test termination. The failure to collect these data may have negligible effects in clinical practice but it is likely to distort air pollution research studies.

Test Failure Due to Lack of Repeatability

Some research subjects cannot perform spirometric tests that are acceptable (i.e., technically valid). Clinical guidance acknowledges these problems and admonishes that the inability to perform spirometry may itself be evidence of lung impairment (Pellegrino, Viegi et al., 2005). Preexisting respiratory conditions (e.g., asthma, chronic obstructive pulmonary disease (COPD)) increase the proportion of subjects who fail (Pellegrino, Decramer et al., 2005). Subjects may produce acceptable maneuvers but not be able to produce repeatable ones. A retrospective study of 18,000 spirometric tests conducted at the Mayo Clinic indicated that about 95% of patients could produce repeatable maneuvers with , the 1995 criterion. The authors wanted the RC tightened so that only 90% would pass. They acknowledged but dismissed reduced performance observed among subjects who were short, female, or had worse baseline lung function (Enright et al., 2004). In a large sample of Norwegians, believed to be representative, 12.7% of females and 7.7% of males failed to meet the 1987 , and 6.8% of females and 7.1% of males failed under the less demanding 1995 (Langhammer, Johnsen, Gulsvik, Holmen, & Bjermer, 2001). Indeed, it was the disparate effect of the 1987 RC that led to its relaxation in 1995 (ATS, 1995). A comparison across 14 sites worldwide, each with approximately 600 adult subjects aged ≥40, following the tighter 2005 RC and using identical spirometers with centralized examiner training, had approximately 10% failure rates, but higher failure rates for older subjects (Enright et al., 2011). Similar age‐dependent failure rates in baseline performance have been observed elsewhere (Kainu, Lindqvist, Sarna, Lundbäck et al., 2008; Lehmann et al., 2004). Whether the test failure rate is 5% or 10% in a representative sample, however, the absence of data from these research subjects creates interpretative difficulties. It imparts a form of nonresponse bias for which no statistical adjustment offers a remedy.

Managing Missing Data Due to Failure to Satisfy RC

The same clinical guidance that calls for the application of RC with early test termination also recommends against discarding valid data (ATS, 1979, 1987, 1995; Miller, Hankinson et al., 2005). Other clinical guidance favors the collection of more rather than fewer data and the deletion of no data, and criticizes device manufacturers that do not store data from all maneuvers: “[I]t can be tempting to discard any apparently discordant results during data collection before having the chance to inspect them more carefully. This runs the risk of retaining data that are ‘reproducibly wrong’ while discarding physiologically valid results!” (Stocks et al., 2014, p. 173). ATS/ERS guidance is unclear concerning what the examiner is to do if a repeatable pair cannot be obtained. Examiners are advised that testing should end if “[a] total of eight tests [sic; should be “maneuvers”] have been performed (optional) or [t]he patient/subject cannot or should not continue” (Miller, Hankinson et al., 2005, p. 325). However, ATS/ERS also advises that “[n]o spirogram or test result should be rejected solely on the basis of its poor repeatability. The repeatability of results should be considered at the time of interpretation. The use of data from manoeuvres with poor repeatability or failure to meet the [end of test] requirements is left to the discretion of the interpreter” (Miller, Hankinson et al., 2005, p. 326). In air pollution research studies, the inherent ambiguity in this guideline may lead to inconsistent data collection and reporting. The examiner could (1) discard subjects who cannot produce a highest and second‐highest FVC and FEV1 within the bounds of the applicable repeatability criterion; (2) assign the highest values obtained irrespective of whether the repeatability criterion is met; or (3) exercise discretion in some other manner to choose which values to assign to the test. The effects of these alternative approaches to missing data are potentially very different, and they are generally not reproducible.

WITHIN‐PERSON TEST VARIABILITY

Variation in spirometry is expected due to age, sex, height, and other factors (ATS, 1979, 1987, 1991, 1995; Hnizdo, Glindmeyer, & Petsonk, 2010; Miller, Hankinson et al., 2005). For healthy adult never‐smokers, performance generally peaks in one's late 20s and declines at a rate of 20–30 mL/year (Hnizdo et al., 2010). Inter‐ and intratest variation can be represented by the coefficient of variation, , where t indexes tests and m indexes maneuvers within test t. This can be separated into the two components, and . can be accounted for using default adjustments (Miller, Crapo et al., 2005) or statistical models (Redlich et al., 2014). It appears that is implicitly acknowledged nearly everywhere (ATS, 1979, 1987, 1995; Miller, Crapo et al., 2005; Miller, Hankinson et al., 2005) but explicitly accounted for nowhere. In practice, the results of multiple maneuvers across tests and maneuvers are summarized by fixed test values, thus implicitly assuming (and its components and ) equal zero.

Within‐Person Intertest Variability,

Within‐person differences observed across multiple, identically conducted tests are more likely to be meaningful than simple before/after comparisons. When only two tests are performed, “large variability necessitates relatively large changes to be confident that a significant change has in fact occurred” (Pellegrino, Viegi et al., 2005, p. 962). The 13% threshold below which within‐day differences are believed not to be significant for COPD patients (Pellegrino, Viegi et al., 2005) has been estimated to imply , with lesser percentages (e.g., ≈3% and ≈4%) assumed but not verified to apply to normal subjects (Hnizdo et al., 2010). Values for have been obtained in population‐representative samples. For example, was estimated at 13% and 12% for men and women, respectively, in a large random sample of asymptomatic Norwegian never‐smokers aged ≥20, using the 1987 RC (0.10 L/sec). A separate examination of the 19 nurses and technicians who performed the tests yielded a sample mean = 4% (Langhammer et al., 2001). The reason for these differences is not explained, but might be due to greater homogeneity among nurses and technicians, their expertise in spirometry, or both. In a project intended to identify the index of lung function with the highest signal‐to‐noise ratio (i.e., the highest ratio of between‐ to within‐subject variance), researchers reported mean of 2.7% for FVC and 3.3% for FEV1. These estimates were within the range of values reported in previous studies (FVC: 1.8−4.9%; FEV1: 2.3−4.7%), but all were unrepresentative small samples, making generalizations inappropriate (Künzli, Ackermann‐Liebrich, Keller, Perruchoud, & Schindler, 1995). values for children also have been reported (Beydon et al., 2007a), but their relevance to adults is unclear, the methods used to obtain them are different, and a larger fraction of children tends to fail spirometric testing (Loeb et al., 2008).

Within‐Person Intratest Variability,

In every spirometric study we have reviewed, there appears to be an implicit assumption that . Large‐scale epidemiological studies such as NHANES (2008, 2008, 2011) also implicitly assume because fixed values are reported for each test. was calculated for each subject in a randomized sample of 648 Finns aged 25−75 (M: 248, F: 355), 603 (93%) of whom met the 1995 RC (0.15 L/sec). Across the sample, mean for FEV1 was 1.4% (95% CI = 1.36−1.51). The distribution of subject‐specific values was not reported (Kainu, Lindqvist, Sarna, & Sovijärvi, 2008). The retrospective study of Mayo Clinic spirometry data reported mean for FEV1 ranging from 2.65% to 3.35% among males and from 1.9% to 4.1% among females. In both cases, estimated was downwardly biased by the apparent exclusion of subjects who could not meet the RC. Unacceptable and nonrepeatable maneuvers were higher among older subjects and those with diminished health status, and increased with smaller physical stature (Enright et al., 2004).

Estimating Intratest Variability in the U.S. Population from NHANES Data Exclusion Rates

The proportion of acceptable maneuvers in the U.S. population excluded due to the RC can be inferred from NHANES (2011). Table I reports the number of maneuvers performed across the sample. If a minimum of three maneuvers is performed, the number of maneuvers should be the same for first through third maneuver. For unexplained reasons, NHANES reports more third than second maneuvers and more second than first maneuvers. The numbers of fourth through eighth maneuvers indicate how many subjects did not meet the RC in maneuvers three to seven, respectively. Three maneuvers of valid data were obtained from approximately 7,200 subjects. However, the RC necessitated a fourth maneuver be performed for 5,035 subjects, 70% of the sample. From maneuver four to maneuver eight, the number of additional maneuvers required ranged from 57% to 71%.

Table I

Exclusion Rates in NHANES 2009–2010 Pulmonary Function Testing

Maneuver m	Maneuvers Performed (Nm)	Maneuvers Accepted (Ami)a	Maneuvers Excluded (Emi)b	Implied Exclusion Rate (%Emi)c
1	6,845d		–	–
2	7,169d
3	7,198	2,163	5,035	70%
4	5,035	1,848	3,187	63%
5	3,187	1,136	2,051	64%
6	2,051	687	1,164	57%
7	1,364	394	970	71%
8	970	968	2	0.2%
9	2e	–	–	–

= – .

= .

= ÷ .

Logically, N 1 should always exceed N 2 and N 2 should always exceed N 3. NHANES (2008) provides no explanation why N 1 < N 2 and N 2 < N 3.

Ninth maneuvers are not documented.

Source: NHANES (2008); NHANES uses the ATS (1995) guidelines (three minimum maneuvers; RC = 0.15 L/sec).

Exclusion Rates in NHANES 2009–2010 Pulmonary Function Testing = – . = . = ÷ . Logically, N 1 should always exceed N 2 and N 2 should always exceed N 3. NHANES (2008) provides no explanation why N 1 < N 2 and N 2 < N 3. Ninth maneuvers are not documented. Source: NHANES (2008); NHANES uses the ATS (1995) guidelines (three minimum maneuvers; RC = 0.15 L/sec). Table II shows the data exclusion fractions obtained using the simulation tool described in Section 3 for RC = 0.15 L/sec and values ranging from 1% to 10%. The value that best fits the NHANES data exclusion fractions is about 6%. The NHANES collection was a probability sample, so 6% appears to approximate the average for the U.S. population.

Table II

Proportion of Acceptable Results Excluded Due to Repeatability Criterion as a Function of Intratest Variability CV

CV_m	Proportion of Acceptable Results Excluded
1%	0.3%
2%	14%
3%	32%
4%	46%
5%	55%
6%	61%
7%	67%
8%	71%
9%	76%
10%	77%

Proportion of Acceptable Results Excluded Due to Repeatability Criterion as a Function of Intratest Variability CV

SIMULATION

To gain insight concerning the effects of RC, , and , a Monte Carlo analysis was performed for a single subject with defined age, height, and normal pulmonary function. The simulation model assumes that each test, and each maneuver within each test, is statistically independent and earned an A grade. Relaxing these assumptions would only increase inter‐ and intratest variability and strengthen the results reported in Section 3.3.

Default Model

Table III shows default simulation model parameters. Normal pulmonary function was obtained from a specific reference equation (Brändli, Schindler, Künzli, Keller, & Perruchoud, 1996). A of 3% () was obtained from Künzli et al. (2000), and is assumed to be 6% () based on the population value derived from NHANES (2008). To ensure high statistical power (90%) and a low nominal a priori Type I error rate (5%), 20,902 tests were performed (Robey & Barcikowski, 1992, Table I). For each independent test, eight independent maneuvers were performed using each test's simulated mean FEV1 and an intratest standard deviation of 0.21 (i.e., ).

Table III

Default Simulation Parameters

Parameters	Common to Both Models
Predicted max FEV₁ (Brändli et al., 1996)	e−8.240+1,9095ln(H)−0.0037(A)−0.000033(A2)=3.55L/ sec , where (H)eight = 173 cm and (A)ge = 59 years.
Estimated SD FEV₁ (Brändli et al., 1996)	0.51
Tests	20,902

Default Simulation Parameters This procedure assures that intratest variability is accounted for without affecting the simulated value obtained for each test. Maneuvers are not in fact independent, though it is unclear how to model their dependence. Subjects’ performance may improve during early maneuvers due to learning and decline in later maneuvers due to fatigue. Thus, if maximum performance is the desired goal, there is an optimal number of maneuvers. But this optimum is unknown, and it is likely to vary across subjects and over time within subjects. To replicate the ATS/ERS protocol, the first three simulated maneuvers from each test were examined to determine whether the highest and second‐highest FEV1 differed by ≤0.15 L/sec. If such a pair was found, the highest value was deemed the maximum, it was recorded as the fixed representation for that test, and the test was presumed to have been terminated. If no qualifying pair was found, the fourth maneuver was compared to the maximum of the first three maneuvers. If the difference between the fourth maneuver and the maximum of the first three maneuvers was ≤0.15 L/sec, the greater of the two values was deemed to be the test maximum and the test was terminated. This procedure was conducted iteratively for up to eight maneuvers to obtain the ATS/ERS protocol maximum. In the alternative model, tests were not terminated when the RC were met. The highest value across all eight maneuvers was deemed to be the maximum for each test. The difference between the unrestricted maximum for each test and the deemed ATS/ERS maximum equals the magnitude of measurement error for each test.

Managing Reproducibility Failure

Discarding valid results that do not satisfy the RC reduces the apparent intratest variability by excising the tails of the distribution. The more stringent the RC, the larger will be the tails excised and the degree to which intratest variability is understated. How much understatement occurs depends on the maneuver at which the RC are met and testing terminates. This is shown in Fig. 1. If only three maneuvers are performed, the number of simulated tests that yield no repeatable pair is about 20% at , 50% at , and 65% at . Unless is very low, even the full complement of eight maneuvers may not be enough to produce sufficient data to ensure that the highest feasible maximum is obtained and intratest variability is not materially understated.

Figure 1

Percentage of simulated tests rejected for lack of ATS/ERS maneuver repeatability, by number of maneuvers performed.

Percentage of simulated tests rejected for lack of ATS/ERS maneuver repeatability, by number of maneuvers performed. As noted in Section 2.3, a choice must be made with respect to the management of simulated tests that fail to produce repeatable pairs and thus cannot be modeled using a strict application of the ATS/ERS protocol. We interpreted the ATS/ERS protocol to require these tests be discarded (option 1 in Section 2.3). Expressed another way, ATS/ERS assumes that subjects who produce maneuvers satisfying the RC are no different from subjects who cannot—an assumption that is likely to be incorrect and artifactually reduce the significance of . Our approach likely departs from typical practice in chamber study and observational epidemiology. In no study in either genre that we have examined have we found subjects excluded for failure to satisfy RC. This means researchers employed options 2 or 3 from Section 2.3, and embedded measurement error may be impossible to estimate. At , there is a 15% probability that no repeatable pair will be obtained even after eight maneuvers (see Fig. 1). Thus, 15 of every 100 subjects are expected to be excluded under the ATS/ERS protocol due solely to lack of repeatability even if all eight maneuvers are performed. Further, much larger fractions will be excluded if researchers conduct only three to four maneuvers. This practice, which is clearly undesirable, nevertheless may be necessary in a research design requiring hourly tests (Adams, 2006a, 2006b; Schelegle et al., 2009). The burden of performing eight maneuvers during the last 10 minutes of each hour may be greater than even young, healthy, athletic subjects can tolerate.

Default Model Results

We compare results from the ATS/ERS protocol terminated after three maneuvers with an unrestricted eight‐maneuver model. This comparison maximizes the magnitude of measurement error, but it appears to most closely approximate actual practice in research settings. Fig. 2 compares the cumulative probability density functions for FEV1 under the ATS/ERS protocol and the unrestricted eight‐maneuver alternative for RC values ranging from 0.1 to 0.2 L/sec. The horizontal difference is measurement error at each point. It is visually apparent that the cause of measurement error is less the choice of RC than the practice of terminating testing once any RC are satisfied.

Figure 2

Proportion of FEV1 tests without a pair of maneuvers satisfying ATS/ERS reproducibility criterion (RC) for three alternative intratest coefficients of variation (CV).

Note: = 3.55 L/sec; = 3%; = 6%; 0.1−0.2 L/sec.

Proportion of FEV1 tests without a pair of maneuvers satisfying ATS/ERS reproducibility criterion (RC) for three alternative intratest coefficients of variation (CV). Note: = 3.55 L/sec; = 3%; = 6%; 0.1−0.2 L/sec. Measurement error can be characterized in L/sec or percentage of baseline. This is shown in Figs. 3(a) (L/sec) and (b) (percentage) for the range of RC values considered. Mean measurement error ranges from about 0.15 L/sec (at ) to about 0.20 L/sec (at ), with the difference rising across the simulated distribution. In percentage terms, however, measurement error can exceed 7% and is never less than 3%.

Figure 3

Mean FEV1 measurement error resulting from test termination after three maneuvers compared to unconstrained maximum (L/sec and %).

Note: = 3.55 L/sec; = 6%; = 6%; 0.1−0.2 L/sec.

Mean FEV1 measurement error resulting from test termination after three maneuvers compared to unconstrained maximum (L/sec and %). Note: = 3.55 L/sec; = 6%; = 6%; 0.1−0.2 L/sec. These magnitudes are neither small nor unimportant in air pollution research. They are the same or greater than within‐day differences for normal subjects that are interpreted by clinicians as not meaningful (Pellegrino, Viegi et al., 2005). They are also considerably greater than reported differences in spirometric performance attributable to test‐subject posture (0.04−0.07 L/sec), which ATS occupational guidance deems a confounding factor large enough to warrant preventive control (Redlich et al., 2014).

Minimum Differences Required to Infer that FEV1 Pairs Come from Different Distributions

Conventional practice treats each test result as fixed (i.e., the intertest standard deviation, , equals zero), so all differences across tests, no matter their magnitude, are treated as presumptively meaningful. Accounting for inter‐ and intratest variability requires differences to be examined statistically. For example, taking only intertest variability into account, the default model assumes = 0.11 (derived from . Fig. 4 shows that any pair of FEV1 values must differ by ≥0.4 L/sec (11%) to infer at p ≤ 0.10 that they come from different distributions (e.g., before and after exposure). When both inter‐ and intratest variability are accounted for, this difference must be ≥0.6 L/sec (16%). This is shown in Fig. 5, which juxtaposes on the same scale the pre‐ and postexposure distributions necessary for (1) the postexposure FEV1 to be below the 10th percentile of the preexposure distribution, and (2) preexposure FEV1 to be above the 90th percentile of the postexposure distribution. The gap between these FEV1 values—0.6 L/sec—is the minimum difference between pre‐ and postexposure mean FEV1 for differences to be statistically significant at p < 0.10. (Stipulating that FEV1 is reasonably expected to decline after exposure, this is equivalent to a one‐tailed test at p < 0.05.)

Figure 4

Minimum decline in FEV1 necessary for statistical significance at p ≤ 0.05 taking only into account.

Figure 5

Minimum decline in FEV1 necessary for statistical significance at p ≤ 0.05 taking both and into account.

Minimum decline in FEV1 necessary for statistical significance at p ≤ 0.05 taking only into account. Minimum decline in FEV1 necessary for statistical significance at p ≤ 0.05 taking both and into account.

Sensitivity Analysis

The simulation model allows results to be calculated using alternative values for the subject characteristics such as sex, height, and age; the ATS/ERS protocol attribute RC; and measures of within‐person inter‐ and intratest variability and . Subject characteristics matter because the RC, a constant, is a rising fraction for persons with lower FEV1 due to age or short stature. For these persons, a larger fraction of valid maneuvers will fail to satisfy the RC. However, the higher rejection rate is counteracted by the ATS/ERS requirement to conduct additional maneuvers, which, ceteris paribus, results in higher maximum test values. If researchers strictly follow the ATS/ERS guidelines and collect up to eight maneuvers, the disproportionate effect of the fixed RC on subjects whose normal pulmonary function is below average will be attenuated. However, they will still have a substantial fraction of subjects for whom there is no acceptable pair of maneuvers, as shown in Table II, and no objective path to resolution. Sensitivity analysis of intertest variability shows that it has a minimal effect regardless of model. However, differences in intratest variability have substantial effects. These differences are summarized in Table IV across the range of ATS/ERS RC values for the default (6%) and two alternatives on either side (0% and 3%, 9%, and 12%). Halving the default reduces measurement error by about 55%. Increasing the default by half increases measurement error by about 65%, and doubling the default increases measurement error by about 125%. corresponds to the ATS/ERS protocol, which by assuming no intratest variability implies no measurement error.

Table IV

Mean FEV1 Measurement Error Under ATS/ERS Protocol After Three Maneuvers, by Repeatability Criterion (RC) and Intratest Coefficients of Variation () (L/sec and %)

	Intratest Coefficient of Variation (CVm)
Repeatability Criterion (RC)	0%a	3%	6%	9%	12%
0.10 L/sec	0.00 L/sec 0.0%	0.08 L/sec 2.1%	0.18 L/sec 4.9%	0.30 L/sec 8.1%	0.41 L/sec 11%
0.15 L/sec	0.00 L/sec 0.0%	0.08 L/sec 2.1%	0.19 L/sec 5.1%	0.31 L/sec 8.4%	0.42 L/sec 11%
0.20 L/sec	0.00 L/sec 0.0%	0.07 L/sec 2.0%	0.07 L/sec 4.7%	0.28 L/sec 7.6%	0.40 L/sec 11%

Notes: Default subject characteristics from Table III. Intertest coefficient of variation () = 3%. L/sec values reported ± 0.005 L/sec. Percentage values reported as two significant figures. Interquartile range: 25th−75th percentile of simulated distribution.

Because 0% yields an undefined result, 0.01% is used to approximate the 0% value implicitly assumed in the ATS/ERS protocol and published studies.

Mean FEV1 Measurement Error Under ATS/ERS Protocol After Three Maneuvers, by Repeatability Criterion (RC) and Intratest Coefficients of Variation () (L/sec and %) Notes: Default subject characteristics from Table III. Intertest coefficient of variation () = 3%. L/sec values reported ± 0.005 L/sec. Percentage values reported as two significant figures. Interquartile range: 25th−75th percentile of simulated distribution. Because 0% yields an undefined result, 0.01% is used to approximate the 0% value implicitly assumed in the ATS/ERS protocol and published studies.

DISCUSSION

RC discard some signals as if they were noise, and early test termination prevents the collection of potentially important signals. When inter‐ and intra‐test variabilities are assumed not to exist, all calculated pulmonary function changes are implicitly assumed to be real, not test protocol artifacts. This leads to unsupportable inferences about the statistical significance of observed differences. Measurement error alone could easily be greater. Additional problems arise if tests are conducted only before and after exposure because intertest variability will not be accounted for. Guidelines recommend against drawing inferences from just two tests: “It is more likely that a real change has occurred when more than two measurements [i.e., tests] are performed over time” (Pellegrino, Viegi et al., 2005, p. 961, emphasis added). For the default comparison, any pair of test values must differ by more than 0.57 L/sec (16%) to be able to infer at p ≤ 0.05 that they are not drawn from the same distribution. A reasonable rule of thumb may be to refrain from interpreting as statistically meaningful any observed difference less than this amount unless and until inter‐ and intratest variability have been taken into account, both in data collection and statistical analysis.

Strengths and Limitations

Our analysis has several key strengths. First, we rely on widely accepted, peer‐reviewed studies of normal pulmonary function for all model parameters except for , for which the available literature is limited. Second, we infer a default value for intratest variability from NHANES, the “gold standard” for empirical data about the U.S. population. This inference is based on an examination of NHANES’ data exclusion rates, recognizing the similarity between the NHANES and ATS/ERS protocols. Third, our Monte Carlo model imposes no additional assumptions besides normality across and within tests for a single person. These assumptions can be modified to conduct unlimited sensitivity analyses. Our analysis has many of the same limitations that affect most research in this field. Spirometry has other sources of inter‐ and intratest variability and potential bias, few of which typically are adequately controlled. Inter‐ and intratest variability can arise from technician quality (all technicians cannot be above average, much less superior), differences in spirometric devices (precision and accuracy vary), data entry, subject–device interactions, test settings, seasonal and diurnal effects, time periods between tests, and confounding effects. Indeed, the ATS/ERS protocols are commendable for including numerous elements intended to reduce the influence of confounders (ATS, 1979, 1987, 1995; Beydon et al., 2007a; Miller, Hankinson et al., 2005; Stocks et al., 2014). Our simulation model has a related limitation insofar as it does not account for improvements in subject performance across maneuvers due to learning or decrements in subject performance across maneuvers due to fatigue. We are unable to capture this effect because no data appear to be publicly available. This is affected by coaching, the quality of which is variable and difficult to measure. It is intuitively reasonable to expect there is an optimal number of maneuvers where the gains from practice equal the losses from fatigue. But the optimum would vary across subjects, coaches, and other factors that cannot be easily modeled. Intratest variability poses additional challenges. It may vary across subjects due to a host of factors. The period between maneuvers (not just tests) may matter, and the optimal spacing of maneuvers is both unknown and likely to vary across subjects. Finally, biological instability may arise between maneuvers insofar as testing induces rapid changes in lung volume that affect airway properties (Beydon et al., 2007a). Our results assume that within‐person FEV1 is approximately normally distributed across both simulated independent tests and simulated independent maneuvers within each test. Results would differ with other distributional forms. We are aware of no empirical evidence supporting normality or any alternative distributional form. Normality across maneuvers might be refuted and could be informed by better intratest data collection, but we are aware of no way to theoretically inform the choice of the intratest distribution. At this stage of knowledge, it is more important to be transparent about the choice of distribution and cognizant of its potential significance. The effects of that choice cannot be quantified, however, as the number of alternative assumptions is infinitely large.

Practical Recommendations

The failure to account for intratest variability is a material limitation of conventional spirometry in research settings. There appears to have been no systematic effort to collect sufficient data to estimate intratest variability, whether for the population, research samples, or subpopulations of interest. All spirometric protocols recognize that intratest variability is important; hence, the universal guidance to conduct multiple maneuvers. But this recognition is abandoned in practice by terminating tests early, thus failing to collect needed data, and discarding all but a single fixed value to represent each test. The result is measurement error and bias. Measurement error has pernicious effects on research intended to make causal inferences about small changes after treatment or exposure. A constructive path forward is to collect enough maneuver data to estimate for the general population (e.g., NHANES), subpopulations presumed to be at greater risk (e.g., COPD patients, asthmatics, children), and any convenience sample (e.g., chamber study volunteers). Wherever possible, sample‐specific should be estimated and tested against these reference values to ensure that inferences about the significance of observed changes are statistically valid.

30 in total

1. Interpretative strategies for lung function tests.

Authors: R Pellegrino; G Viegi; V Brusasco; R O Crapo; F Burgos; R Casaburi; A Coates; C P M van der Grinten; P Gustafsson; J Hankinson; R Jensen; D C Johnson; N MacIntyre; R McKay; M R Miller; D Navajas; O F Pedersen; J Wanger
Journal: Eur Respir J Date: 2005-11 Impact factor: 16.671

2. Comparison of chamber 6.6-h exposures to 0.04-0.08 PPM ozone via square-wave and triangular profiles on pulmonary responses.

Authors: William C Adams
Journal: Inhal Toxicol Date: 2006-02 Impact factor: 2.724

3. Lung function testing: selection of reference values and interpretative strategies. American Thoracic Society.

Authors:
Journal: Am Rev Respir Dis Date: 1991-11

Review 4. How to avoid misinterpreting lung function tests in children: a few practical tips.

Authors: Janet Stocks; Jane Kirkby; Sooky Lum
Journal: Paediatr Respir Rev Date: 2014-02-13 Impact factor: 2.726

5. Quality of spirometry tests performed by 9893 adults in 14 countries: the BOLD Study.

Authors: P Enright; W M Vollmer; B Lamprecht; R Jensen; A Jithoo; W Tan; M Studnicka; P Burney; S Gillespie; A S Buist
Journal: Respir Med Date: 2011-05-06 Impact factor: 3.415

6. Quality control of spirometry: a lesson from the BRONCUS trial.

Authors: R Pellegrino; M Decramer; C P O van Schayck; P N R Dekhuijzen; T Troosters; C van Herwaarden; D Olivieri; M Del Donno; W De Backer; I Lankhorst; A Ardia
Journal: Eur Respir J Date: 2005-12 Impact factor: 16.671

7. 6.6-hour inhalation of ozone concentrations from 60 to 87 parts per billion in healthy humans.

Authors: Edward S Schelegle; Christopher A Morales; William F Walby; Susan Marion; Roblee P Allen
Journal: Am J Respir Crit Care Med Date: 2009-05-15 Impact factor: 21.405

8. Acceptability and repeatability of spirometry in children using updated ATS/ERS criteria.

Authors: Jeffrey S Loeb; Walter C Blower; Julie F Feldstein; Beth A Koch; Asia L Munlin; William D Hardie
Journal: Pediatr Pulmonol Date: 2008-10

Review 9. How to make sure your spirometry tests are of good quality.

Authors: Paul L Enright
Journal: Respir Care Date: 2003-08 Impact factor: 2.258

10. FEV1 response to bronchodilation in an adult urban population.

Authors: Annette Kainu; Ari Lindqvist; Seppo Sarna; Bo Lundbäck; Anssi Sovijärvi
Journal: Chest Date: 2008-04-10 Impact factor: 9.410

2 in total

1. Quantitative CT Analysis in Patients with Pulmonary Emphysema: Do Calculated Differences Between Full Inspiration and Expiration Correlate with Lung Function?

Authors: Lan Song; Jonas A Leppig; Ralf H Hubner; Bianca C Lassen-Schmidt; Konrad Neumann; Dorothea C Theilig; Felix W Feldhaus; Ute L Fahlenkamp; Bernd Hamm; Wei Song; Zhengyu Jin; Felix Doellinger
Journal: Int J Chron Obstruct Pulmon Dis Date: 2020-08-03

2. A comparison of alternative selection methods for reporting spirometric parameters in healthy adults.

Authors: Jennifer H Therkorn; Daniella R Toto; Michael J Falvo
Journal: Sci Rep Date: 2021-07-22 Impact factor: 4.996

2 in total