Literature DB >> 33978306

Selective cutoff reporting in studies of the accuracy of the Patient Health Questionnaire-9 and Edinburgh Postnatal Depression Scale: Comparison of results based on published cutoffs versus all cutoffs using individual participant data meta-analysis.

Dipika Neupane^1,2, Brooke Levis^1,2,3, Parash M Bhandari^1,2, Brett D Thombs^{1,2,4,5,6,7,8}, Andrea Benedetti^2,5,9.

Abstract

OBJECTIVES: Selectively reported results from only well-performing cutoffs in diagnostic accuracy studies may bias estimates in meta-analyses. We investigated cutoff reporting patterns for the Patient Health Questionnaire-9 (PHQ-9; standard cutoff 10) and Edinburgh Postnatal Depression Scale (EPDS; no standard cutoff, commonly used 10-13) and compared accuracy estimates based on published cutoffs versus all cutoffs.
METHODS: We conducted bivariate random effects meta-analyses using individual participant data to compare accuracy from published versus all cutoffs.
RESULTS: For the PHQ-9 (30 studies, N = 11,773), published results underestimated sensitivity for cutoffs below 10 (median difference: -0.06) and overestimated for cutoffs above 10 (median difference: 0.07). EPDS (19 studies, N = 3637) sensitivity estimates from published results were similar for cutoffs below 10 (median difference: 0.00) but higher for cutoffs above 13 (median difference: 0.14). Specificity estimates from published and all cutoffs were similar for both tools. The mean cutoff of all reported cutoffs in PHQ-9 studies with optimal cutoff below 10 was 8.8 compared to 11.8 for those with optimal cutoffs above 10. Mean for EPDS studies with optimal cutoffs below 10 was 9.9 compared to 11.8 for those with optimal cutoffs greater than 10.
CONCLUSION: Selective cutoff reporting was more pronounced for the PHQ-9 than EPDS.

Entities: Chemical

Keywords: diagnostic test accuracy; individual participant data meta-analysis; meta-analysis; publication bias; selective cutoff reporting

Mesh：

Year: 2021 PMID： 33978306 PMCID： PMC8412225 DOI： 10.1002/mpr.1873

Source DB: PubMed Journal: Int J Methods Psychiatr Res ISSN： 1049-8931 Impact factor: 4.182

INTRODUCTION

Selective reporting occurs when authors make decisions regarding publication of study results based on whether or not outcomes are favorable (Kirkham et al., 2010). In accuracy studies of ordinal or continuous tests, selective cutoff reporting occurs when results are published for one or more cutoffs that maximize sensitivity and specificity in a particular study but not for other relevant cutoffs (Levis et al., 2017; Moriarty et al., 2015). Selective cutoff reporting can lead to overestimation of diagnostic accuracy in primary studies and in meta‐analyses that synthesize results from primary studies with selectively reported results (Leeflang et al., 2008). Only one previous study has investigated selective cutoff reporting patterns in test accuracy studies (Levis et al., 2017). That study obtained individual participant data (IPD) from 13 primary studies included in a published meta‐analysis (Manea et al., 2012) of the accuracy of the Patient Health Questionnaire‐9 (PHQ‐9) depression screening tool. Results based on two sets of meta‐analysis were compared. First, meta‐analyses were conducted where the result at each cutoff was based only on those studies that published results at that cutoff. Second, meta‐analyses were conducted based on the IPD; the result at each cutoff was calculated from all studies available regardless of what cutoff was originally published. Sensitivity estimates differed substantially between published and IPD datasets for cutoffs lower and higher than the standard cutoff of 10 (meaning cutoff ≥10) but were similar at the standard cutoff. This was because most studies published results for the standard cutoff, but authors tended to publish results from cutoffs lower or higher than 10 depending on whether the PHQ‐9 was relatively poorly sensitive but specific (lower cutoffs published) or highly sensitive but poorly specific (higher cutoffs published) in their dataset. A cutoff of 10 is used as the standard cutoff for screening for major depression with the PHQ‐9 (Gilbody et al., 2007; Kroenke et al., 2001; Kroenke & Spitzer, 2002; Spitzer et al., 1999; Wittkampf et al., 2007) and maximizes combined sensitivity and specificity (Levis et al., 2019), but standard cutoffs are less well‐defined for other depression screening tools. Studies of the Edinburgh Postnatal Depression Scale (EPDS), the most commonly used screening tool among women in pregnancy and postpartum (Hewitt et al., 2009; Howard et al., 2014), typically consider cutoffs between 10 and 13 as standard, with 13 being most commonly used (Hewitt et al., 2009; O'Connor et al., 2016). A recent IPD meta‐analysis (IPDMA) found that cutoff 11 maximizes combined sensitivity and specificity (Levis et al., 2020). The degree to which there is an agreed upon standard cutoff for a screening tool may influence selective cutoff reporting. Thus, this study aimed to compare selective cutoff reporting in screening tools with and without a well‐defined standard cutoff. We evaluated selective cutoff reporting with a substantially larger set of PHQ‐9 studies than was used in the previous study (Levis et al., 2017) and compared results to the EPDS, which does not have a well‐defined standard cutoff. Specific objectives were to use IPDMA with the PHQ‐9 and EPDS, separately, to (1) compare sensitivity and specificity based on all cutoffs from all primary studies versus data from only cutoffs for which accuracy estimates were published in the primary studies; and (2) explore cutoff reporting patterns with reference to the identified optimal cutoff in each primary study.

METHODS

We analyzed data accrued for IPDMAs on PHQ‐9 and EPDS diagnostic accuracy (PROSPERO CRD42014010673, CRD42015024785), and protocols were published for each IPDMA (Thombs et al., 2014, 2015). The protocol for the present study, which was not part of the main IPDMA protocols, was published separately (https://osf.io/vw3bz/). The protocol described only the EPDS analysis, and we subsequently added the PHQ‐9 to be able to compare screening tools with and without well‐defined standard cutoffs. As this study involved only analysis of previously collected de‐identified data and because all included studies were required to have obtained ethics approval and informed consent, the Research Ethics Committee of the Jewish General Hospital determined that ethics approval was not required.

Study eligibility

Datasets from articles in any language were eligible for the main IPDMAs if (1) they used the PHQ‐9 or EPDS; (2) they included diagnostic classification for current Major Depressive Disorder (MDD) or Major Depressive Episode (MDE) using Diagnostic and Statistical Manual of Mental Disorders (DSM) or International Classification of Diseases (ICD) criteria based on a validated diagnostic interview; (3) the interview and PHQ‐9 or EPDS were administered within 2 weeks of each other; (4) participants were ≥18 years and not recruited from school‐based settings (PHQ‐9) or ≥18 years and pregnant or within 12 months postpartum (EPDS); and (5) participants were not recruited from psychiatric settings or because they had symptoms of depression, since screening is done to identify previously unrecognized cases. Datasets where not all participants were eligible were included if primary data allowed selection of eligible participants. Many primary studies in the main IPDMA databases that contributed eligible datasets never published estimates of screening accuracy. Thus, for the present study, we restricted analyses to primary studies with publications that included sensitivity and specificity estimates for at least one PHQ‐9 or EPDS cutoff for identifying major depression. We excluded studies if the sample size from the published primary study differed by >10% from the sample included in our IPDMA datasets. Sample sizes from original primary studies and the IPDMA databases differed in some cases because, for instance, we excluded participants who were included in the original studies if there were >2 weeks between their index test and reference standard administrations or if they were <18 years old. We also excluded primary studies with publications that reported accuracy results only for diagnostic classifications broader than major depression (e.g., “any depressive disorder”) if the number of cases in the published article and IPDMA datasets differed by >10%.

Search strategy and study selection

A medical librarian searched Medline, Medline In‐Process & Other Non‐Indexed Citations and PsycINFO via OvidSP, and Web of Science via ISI Web of Knowledge from January 1, 2000 to February 7, 2015 (Method S1a) for the PHQ‐9 and from inception to June 10, 2016 (Method S1b) for the EPDS, using peer‐reviewed search strategies (McGowan et al., 2016). We also reviewed reference lists of relevant reviews and queried contributing authors about non‐published studies. Search results were uploaded into RefWorks (RefWorks‐COS) for de‐duplication and then into DistillerSR (Evidence Partners). Two investigators independently reviewed titles and abstracts. If either deemed a study potentially eligible, full‐text review was done by two investigators, independently, with disagreements resolved by consensus, consulting a third investigator when necessary. Translators were consulted for languages other than those for which team members were fluent.

Data contribution, extraction, and synthesis

Authors of eligible datasets were emailed invitations to contribute de‐identified primary data at least three times, as necessary. If there was no response, we emailed co‐authors and attempted phone contact. For each study, we compared published results with results from raw datasets and resolved any discrepancies in consultation with primary study investigators. For defining major depression, we considered MDD or MDE based on DSM or ICD. If more than one was reported, we prioritized MDE over MDD and DSM over ICD. For studies with multiple time points, we included data from only the time point with the most participants. To facilitate comparison between published results and IPDMA results, we applied sampling weights in the IPDMA only when accuracy results reported in the original published study were calculated using weights. We determined whether included primary studies cited the Standards for Reporting of Diagnostic Test Accuracy (STARD) guideline in the publication or not (Bossuyt et al., 2003).

Statistical analyses

We replicated the statistical analyses used in the previous study of selective cutoff reporting with the PHQ‐9 (Levis et al., 2017). We estimated sensitivity and specificity from cutoffs up to 5 points below and above cutoffs used as standard (PHQ‐9 cutoff 10, range 5–15; EPDS cutoffs 10–13, range 5–18). We compared meta‐analyses results from data using only cutoffs for which accuracy estimates were published in the primary studies (the published dataset) and using data from all cutoffs from all studies (the full dataset). For both sets of meta‐analyses, for each cutoff, bivariate random‐effects models were estimated via Gauss‐Hermite quadrature (Riley et al., 2008). This approach models sensitivity and specificity simultaneously, accounting for the inherent correlation between them and the precision of estimates within studies.

Differences in sensitivity and specificity estimates using published versus full datasets

In order to examine differences in results produced by meta‐analyses based on published and full datasets, we constructed separate pooled receiver operator characteristic (ROC) curves. In addition, 95% confidence intervals for the differences in sensitivity and specificity at each cutoff were constructed via bootstrap (Van der Leeden et al., 1997, 2008) resampling at the study and subject level with 1000 iterations for each cutoff. We calculated the median absolute difference in estimated sensitivity and specificity across evaluated cutoffs.

Reporting patterns

We assessed whether primary studies tended to preferentially report low or high cutoffs depending on the study's sample‐specific optimal cutoff. For each primary study, we identified the optimal cutoff that the authors explicitly described as optimal or using a similar term. If the authors did not identify an optimal cutoff, we used the cutoff that maximized Youden's J (sensitivity + specificity−1) (Youden, 1950). For each study, we plotted the optimal cutoff, along with all other cutoffs for which results were published. We noted whether the reported cutoffs tended to be low or high compared to the standard cutoff (PHQ‐9 10) or set of commonly used cutoffs (EPDS 10–13). For studies with optimal cutoffs below and above the standard or commonly used cutoffs, separately, we calculated the mean of the cutoffs reported.

RESULTS

Identification of eligible studies

Patient Health Questionnaire‐9

Of 58 studies included in the main IPDMA (Levis et al., 2019), 28 were excluded from the present study because they did not publish diagnostic accuracy results for any PHQ‐9 cutoffs or because the number of participants or major depression cases in the IPD dataset differed by >10% from the published studies or could not be determined (Figure S1a; Tables S1a and S2a). The final dataset included 30 studies (N total: 11,773; N major depression: 1587 [13%]; Table S3a) that compared the PHQ‐9 with a validated diagnostic interview (Mini Neurospsychiatric Diagnostic Interview, Structured Clinical Interview for DSM Disorders, Composite International Diagnostic Interview, Clinical Interview Schedule Revised, Schedules for Clinical Assessment in Neuropsychiatry or Computerized Diagnostic Interview Schedule). Of the 30 included studies, 7 reported only a single cutoff and 23 reported more than one cutoff. Of the 23 with multiple cutoffs reported, 18 identified an optimal cutoff in the published study; of those, 16 (89%) were described as based on Youden's J (N: 8) or equivalent to Youden's calculated from published cutoffs but did not have an explanation (N: 8). Among the 30 studies, only two cited the STARD reporting guideline (Arroll et al., 2010; Sherina et al., 2012).

Edinburgh Postnatal Depression Scale

Of 49 studies in the original IPDMA dataset (Levis et al., 2020), 30 studies were not eligible and thus excluded from the present study (Figure S1b; Tables S1b and S2b). Thus, 19 unique studies (N total: 3637, N major depression: 531 [15%]) were included (Table S3b), which compared the EPDS with a validated diagnostic interview including Mini Neuropsychiatric Diagnostic Interview, Structured Clinical Interview for DSM Disorders, Clinical Interview Schedule and Diagnostic Interview of Genetic Studies. Of the 14 studies that reported more than one cutoff, 13 identified an optimal cutoff; of those 10 (77%) were based on Youden's J (N: 2) or did not have an explanation but matched what would have been obtained using Youden's J calculated from published cutoffs (N: 8). None of the studies cited STARD.

Differences in sensitivity and specificity estimates based on published versus full datasets

Table 1 shows sensitivity and specificity for the PHQ‐9 and EPDS at each cutoff for the published and full datasets with the ROC plots in Figures 1 and 2.

TABLE 1

Comparison of accuracy results from IPDMA of PHQ‐9 and EPDS with the published dataset only versus the full dataset

PHQ‐9
Published dataset								Full dataset 30 studies; N = 11 773; MD cases = 1587
Cutoff	No. of studies	No. of participants	No of MD cases	Sensitivity	95% CI	Specificity	95% CI	Sensitivity	95% CI	Specificity	95% CI
5	5	1663	367	0.91	0.86, 0.94	0.68	0.55, 0.79	0.97	0.94, 0.98	0.54	0.48, 0.60
6	6	2193	377	0.87	0.77, 0.93	0.72	0.61, 0.82	0.96	0.92, 0.97	0.62	0.56, 0.68
7	6	2050	438	0.87	0.75, 0.93	0.72	0.60, 0.81	0.94	0.90, 0.97	0.69	0.63, 0.74
8	12	5798	720	0.87	0.78, 0.92	0.77	0.70, 0.82	0.92	0.87, 0.95	0.75	0.70, 0.79
9	14	5283	766	0.85	0.76, 0.91	0.81	0.75, 0.85	0.87	0.81, 0.91	0.80	0.76, 0.84
10	26	10 593	1378	0.82	0.74, 0.88	0.86	0.83, 0.89	0.83	0.76, 0.88	0.85	0.81, 0.88
11	15	5292	767	0.83	0.72, 0.91	0.88	0.83, 0.92	0.76	0.69, 0.82	0.88	0.85, 0.91
12	16	6188	832	0.73	0.63, 0.81	0.91	0.87, 0.94	0.69	0.62, 0.75	0.91	0.88, 0.93
13	9	2104	455	0.70	0.59, 0.79	0.95	0.87, 0.98	0.60	0.54, 0.67	0.93	0.91, 0.95
14	5	1231	277	0.63	0.47, 0.76	0.96	0.89, 0.99	0.54	0.47, 0.61	0.95	0.93, 0.96
15	6	3546	374	0.47	0.37, 0.59	0.97	0.97, 0.98	0.47	0.40, 0.54	0.96	0.95, 0.97

Abbreviations: CI, Confidence Interval; EPDS, Edinburgh Postnatal Depression Scale; IPDMA, Individual Participant Data Meta‐analysis; MD, Major Depression.

For these cutoffs, one sample proportion test with continuity correction was used to estimate sensitivity and specificity and confidence intervals.

FIGURE 1

Receiver operating characteristic (ROC) curves plot for the diagnostic accuracy of Patient Health Questionnaire‐9 (PHQ‐9). The points in the ROC curves indicate each of the PHQ‐9 cutoffs between 5 (right) and 15 (left)

FIGURE 2

Receiver operating characteristic (ROC) curves plot for the diagnostic accuracy of Edinburgh Postnatal Depression Scale (EPDS). The points in the ROC curves indicate each of the EPDS cutoffs between 5 (right) and 18 (left)

Comparison of accuracy results from IPDMA of PHQ‐9 and EPDS with the published dataset only versus the full dataset Abbreviations: CI, Confidence Interval; EPDS, Edinburgh Postnatal Depression Scale; IPDMA, Individual Participant Data Meta‐analysis; MD, Major Depression. For these cutoffs, one sample proportion test with continuity correction was used to estimate sensitivity and specificity and confidence intervals. Receiver operating characteristic (ROC) curves plot for the diagnostic accuracy of Patient Health Questionnaire‐9 (PHQ‐9). The points in the ROC curves indicate each of the PHQ‐9 cutoffs between 5 (right) and 15 (left) Receiver operating characteristic (ROC) curves plot for the diagnostic accuracy of Edinburgh Postnatal Depression Scale (EPDS). The points in the ROC curves indicate each of the EPDS cutoffs between 5 (right) and 18 (left) The difference between estimated sensitivity (published—full dataset) ranged from −0.09 to 0.10 (median: 0.06; Table 2). For cutoffs below 10, estimated sensitivity was lower for the published dataset (−0.02 to −0.09; median: −0.06) with 95% CIs including zero but inclining more towards negative, whereas estimated specificity was higher (0.01 to 0.14; median: 0.03) with 95% CIs including zero. For the standard cutoff 10, the differences in sensitivity and specificity were −0.01 (95% CI: −0.05, 0.01), and 0.01 (95% CI: 0.00, 0.04), respectively. For cutoffs above 10, estimated sensitivity was higher for the published dataset (0.00 to 0.10; median: 0.07) with 95% CIs including zero but inclining more towards positive, and estimated specificity was similar (0.00 to 0.02; median: 0.01) with 95% CIs including zero.

TABLE 2

Differences in estimated sensitivity and specificity using the published dataset only versus the full dataset for PHQ‐9 and EPDS

PHQ‐9
% of participants included in published results for each cutoff			Differences in estimates using published dataset versus full dataset (published ‐ full)
Cutoff	% participants	% MD cases	Sensitivity		Specificity
Cutoff	% participants	% MD cases	Estimated difference	Bootstrap 95% CI	Estimated difference	Bootstrap 95% CI
5	14	23	−0.06	−0.13, 0.00	0.14	0.02, 0.26
6	19	24	−0.09	−0.18, −0.01	0.10	0.00, 0.20
7	17	28	−0.07	−0.20, 0.00	0.03	−0.09, 0.15
8	49	45	−0.05	−0.14, 0.02	0.02	−0.03, 0.08
9	45	48	−0.02	−0.11, 0.05	0.01	−0.04, 0.05
10	90	87	−0.01	−0.05, 0.01	0.01	0.00, 0.04
11	45	48	0.07	0.00, 0.13	0.00	−0.03, 0.03
12	53	52	0.04	−0.03, 0.09	0.00	−0.02, 0.03
13	18	29	0.10	−0.02, 0.20	0.02	−0.04, 0.05
14	10	17	0.09	−0.07, 0.23	0.01	−0.04, 0.04
15	30	24	0.00	−0.12, 0.13	0.01	0.00, 0.03

Note: For PHQ‐9, 15 iterations (1.5%) that did not produce difference estimates were removed prior to determining the bootstrap CI.

For EPDS, 284 iterations (28.4%) for cutoffs 5‐6, 32 iterations (3.2%) for cutoffs 7‐15 and 275 iterations (27.5%) for cutoff 16 that did not produce difference estimates were removed prior to determining bootstrap CIs. Only 1 study published EPDS cutoffs 17 and 18, so only participant level resampling was done for published dataset.

Abbreviations: CI, Confidence Interval, EPDS: Edinburgh Postnatal Depression Scale, PHQ‐9, Patient Health Questionnaire‐9.

Differences in estimated sensitivity and specificity using the published dataset only versus the full dataset for PHQ‐9 and EPDS Note: For PHQ‐9, 15 iterations (1.5%) that did not produce difference estimates were removed prior to determining the bootstrap CI. For EPDS, 284 iterations (28.4%) for cutoffs 5‐6, 32 iterations (3.2%) for cutoffs 7‐15 and 275 iterations (27.5%) for cutoff 16 that did not produce difference estimates were removed prior to determining bootstrap CIs. Only 1 study published EPDS cutoffs 17 and 18, so only participant level resampling was done for published dataset. Abbreviations: CI, Confidence Interval, EPDS: Edinburgh Postnatal Depression Scale, PHQ‐9, Patient Health Questionnaire‐9. The difference between estimated sensitivity ranged from −0.02 to 0.20 (median: 0.03) with all 95% CIs including zero (Table 2). For cutoffs below 10, estimated sensitivity (−0.02 to 0.01; median: 0.00), and estimated specificity (‐0.01 to 0.02; median: 0.01) were similar for the published and full datasets. For cutoffs of 10 to 13, estimated sensitivity differed by 0.02 to 0.03 (median: 0.03), and estimated specificity differed by ‐0.02 to 0.00 (median: ‐0.02). For cutoffs above 13, estimated sensitivity was higher for the published dataset (0.08 to 0.20; median: 0.14), and estimated specificity was similar or lower (‐0.08 to 0.00; median: 0.00).

Reporting patterns

Figure 3 shows the pattern of reporting with respect to optimal cutoffs for included PHQ‐9 studies; 9 studies had optimal cutoffs below 10, 14 equal to 10, 6 greater than 10 and 1 study had optimal cutoffs of both 10 and 12. Studies for which the PHQ‐9 was poorly sensitive at the cutoff 10 (sensitivity: 0.27–0.74) (Arroll et al., 2010; Inagaki et al., 2013; Lambert et al., 2015; Lotrakul et al., 2008; Pence et al., 2012; Thombs et al., 2008; Stafford et al., 2007; Sung et al., 2013; Turner et al., 2012) had optimal cutoffs that were below 10. These studies tended to report more cutoffs below 10 than above 10 (mean of reported cutoffs: 8.8). Studies for which the PHQ‐9 was highly sensitive at cutoff 10 (sensitivity: 0.85–1.00) (Bombardier et al., 2012; Delgadillo et al., 2011; Fann et al., 2005; Khamseh et al., 2011; Lowe et al., 2004; Twist et al., 2013) had optimal cutoffs that were greater than 10. These studies tended to report more cutoffs above 10 than below 10 (mean of reported cutoffs: 11.8).

FIGURE 3

Pattern of cutoff reporting for PHQ‐9 studies. Cells shaded in gray represent cutoff points for which diagnostic accuracy results are reported in the primary studies. “O” represents the optimal cutoff for PHQ‐9 explicitly stated in the studies except for Inagaki et al. (2013), Pence et al. (2012), Arroll (2010), Cholera (2014), Amoozegar (2017), which did not identify an optimal cutoff. For those, Youden's J optimal was calculated from published accuracies. For Gjerdingen (2009) and Vöhringer (2013), only one cutoff was reported without stating whether it was optimal or not. van Steenbergen‐Weijenburg 2010 reported 10 and 12 as optimal cutoffs. Studies that reported accuracies for cutoffs beyond presented in the table: Inagaki et al. (2013) reported the accuracy for cutoffs 4–13, Thombs (2008) reported the accuracy for cutoffs 1–10, Lambert et al. (2015) reported the accuracy for cutoffs 5, 9, 10, 15, 20, Hyphantis (2011) reported the accuracy for cutoffs 4–16, Osorio (2009) reported the accuracy for cutoffs 10–21. All the reported cutoffs were included while calculating the mean of reported cutoffs though they are not shown in the figure Figure 4 shows the pattern of reporting cutoffs for the EPDS; 5 studies had optimal cutoffs below 10, 13 between 10 and 13, and 1 greater than 13. Studies for which the EPDS was poorly sensitive at cutoff 10 (sensitivity: 0.43–0.73) (Bakare et al., 2014; Chaudron et al., 2010; Radoš et al., 2013; Thiagayson et al., 2013; Toreki et al., 2013) had optimal cutoffs that were less than 10 (mean of reported cutoffs: 9.9). Studies for which EPDS was highly sensitive at cutoff 10 (sensitivity: 0.82–1.00) (Alvarado et al., 2015; Beck & Gable, 2001; Bunevicius et al., 2009; Couto et al., 2015; Garcia‐Esteve et al., 2003; Khalifa et al., 2015; Phillips et al., 2009; Rochat et al., 2013; Su et al., 2007; Tandon et al., 2012; Toreki et al., 2014; Vega‐Dienstmaier et al., 2002) had optimal cutoffs greater than 10. These studies tended to report more cutoffs above 10 than below 10 (mean of reported cutoffs: 11.8). All of these studies had optimal cutoffs between 10 and 13 with one exception, a study reported accuracy only for cutoff 13 even though sensitivity was low at this cutoff (sensitivity: 0.35) (Pawlby et al., 2008).

FIGURE 4

Pattern of cutoff reporting for EPDS studies. Cells shaded in gray represent cutoff points for which diagnostic accuracy results are reported in the primary studies. “O” represents the optimal cutoff for EPDS explicitly stated in the studies except for Philips (2009), which did not identify an optimal cutoff. For Philips 2009, Youden's J optimal was calculated from published accuracies. For Bakare et al. (2014), Pawlby et al. (2008), Beck 2001 only one cutoff was reported without stating whether it was optimal or not. Studies that reported accuracies for cutoffs beyond presented in the table: Khalifa et al. (2015) reported accuracy for cutoffs 1–15, Vega‐Dienstmaier et al. (2002) reported the accuracy for cutoffs 1–26. All the reported cutoffs were included while calculating the mean of reported cutoffs though they are not shown in the figure

DISCUSSION

We compared bias in accuracy and selective cutoff reporting between the PHQ‐9, which has a clearly defined standard cutoff and the EPDS, which does not have a clearly defined standard cutoff, using IPD. Selective cutoff reporting was more pronounced for the PHQ‐9, and bias in estimated accuracy of published cutoffs compared to all cutoffs was similarly greater for the PHQ‐9. For the PHQ‐9, compared to meta‐analysis of the full dataset, which included results for all relevant cutoffs for all included studies, specificity estimates using the published dataset, which included results from published cutoffs only, were similar; however, sensitivity was underestimated in the published dataset for cutoffs below 10, similar for the standard cutoff 10, and overestimated for cutoffs above 10. The cutoff reporting pattern in primary studies explains this pattern of under and overestimation of sensitivity. Studies in which the PHQ‐9 was poorly sensitive but more specific identified cutoffs below 10 as optimal and reported more cutoffs below 10, whereas studies in which the PHQ‐9 was highly sensitive but less specific identified cutoff above 10 as optimal and reported more cutoffs above 10. For the EPDS, compared to the full dataset, specificity estimates using the published dataset was similar across all cutoffs; however, sensitivity estimates were similar for cutoffs below 10 and for the most commonly reported cutoffs 10–13, but overestimated for cutoffs above 13. Unlike the PHQ‐9, only primary studies in which the EPDS was highly sensitive at cutoff 10 reported more cutoffs above 10. Studies with poor sensitivity that reported optimal cutoffs below 10 reported results from cutoffs above 10 more often than comparable studies with the PHQ‐9. This may be because the PHQ‐9 has a single standard cutoff of 10, whereas for the EPDS it is an expectation that results for commonly used cutoffs of 10–13 are reported. The 2001 PHQ‐9 validation study, which included only 41 major depression cases, identified 10 as the standard cutoff (Kroenke et al., 2001; Spitzer et al., 1999). Similarly, the 1987 EPDS validation study, which included only 24 definite or probable major depression cases, suggested that cutoffs of 10 or 13 could be used (Cox et al., 1987). Consequently, most PHQ‐9 studies report accuracy for cutoff 10, but selectively reported accuracy for cutoffs other than 10 depending upon the sensitivity at cutoff 10 (Levis et al., 2017; Moriarty et al., 2015). In the absence of a single standard cutoff, EPDS studies often report a range of cutoffs from 10 to 13 (Hewitt et al., 2009; O'Connor et al., 2016). Only one previous study, an IPDMA with 13 studies (4589 participants, 1037 major depression cases), has examined selective cutoff reporting in screening instruments (for the PHQ‐9) (Levis et al., 2017). We replicated the analysis with much larger sample (30 studies; 11,773 participants; 1587 cases) and found that although the reporting patterns were similar, the magnitude of bias was lower in the present study. In the previous study, when the cutoff increased from 9 to 10 and 10 to 11, the sensitivity also increased markedly, an impossible finding if all data are analyzed. In the present study, the sensitivity increased when cutoff increased from 10 to 11, but the increment was minimal. The reduction in the magnitude of bias due to selective reporting compared to the previous study may be due to improved reporting practices over time. This could, however, also be a result of differences in inclusion criteria in the two studies. Of the 13 primary studies included in the previous study, six were excluded from the present study for one of the following reasons: selecting sample for existing distress, mental health diagnosis or from psychiatric settings; having >10% difference in sample size or MDD cases between IPD and published dataset; or administering the PHQ‐9 and diagnostic interview more than 2 weeks apart. Primary studies are often carried out to identify optimal cutoffs and explore accuracy of a screening tool in a specific population; regardless, the full range of cutoffs should be reported. According to STARD reporting guidelines, diagnostic accuracy estimates and precision, as well as the cross tabulation of the index test and the reference standard should be reported (Bossuyt et al., 2015). The guideline should also recommend reporting accuracy estimates for all relevant cutoffs for ordinal index tests. Citation of the STARD guideline, however, was not common; only 2 of 49 PHQ‐9 and EPDS studies (Arroll et al., 2010; Sherina et al., 2012) cited it. When data are missing from some cutoffs in primary studies, conventional meta‐analyses based on published cutoffs only may result in biased accuracy estimates. Accuracy estimates can be corrected in meta‐analyses using modelling techniques (Benedetti et al., 2020) or by doing IPDMA, which has some advantages, but is highly resource intensive (Cochrane methods: IPD meta‐analysis, 2020; Ioannidis et al., 2002; Riley et al., 2010; Stewart & Tierney, 2002). The major strength of this study is that we compared two depression screening instruments with different characteristics using IPDMA. We explored how the presence of a clearly defined standard cutoff versus the absence of such a standard may be associated with bias in accuracy. A potential limitation is that we calculated the optimal cutoff based on Youden's J for the studies not specifying an optimal cutoff. Those studies may not have considered the cutoff that maximized Youden's J as optimal. However, Youden's J appears to be the most typical method of identifying optimal cutoff thresholds for depression screening measures. In the present study, 16 of 18 (89%) PHQ‐9 studies and 10 of 13 (77%) EPDS studies with multiple reported cutoffs that identified an optimal cutoff used Youden's J or identified an optimal cutoff that was equivalent to the Youden's J optimal cutoff. Another possible limitation is that we examined primary studies regardless of the reference standard that was used in each study. We have previously shown that different types of diagnostic interviews perform differently (Wu et al., 2021). We do not believe, however, that the reference standard used would have likely influenced decisions about which cutoffs to report in primary studies. When studies appeared to report cutoffs selectively depending upon the sensitivity at the standard cutoff, synthesis of accuracy results from published cutoffs led to underestimation of sensitivity below the standard cutoff and overestimation of sensitivity above the standard cutoff. This phenomenon appears to be diluted for EPDS when the standard cutoff is not clearly defined and there is a range of commonly used and reported cutoffs, because the primary studies tend to report a range of cutoffs around the true optimal cutoff. To reduce bias in evidence syntheses, researchers conducting primary studies should report accuracy estimates or a contingency table for all relevant cutoffs, or make their primary data available. Researchers who conduct meta‐analyses should use modelling approaches to overcome possible biases from selective cutoff reporting or should use an IPDMA approach.

CONFLICT OF INTEREST

All authors have completed the ICJME uniform disclosure form and declare: no support from any organization for the submitted work; no financial relationships with any organizations that might have an interest in the submitted work in the previous 3 years with the following exceptions: Dr. Tonelli declares that he has received a grant from Merck Canada, outside the submitted work. Dr. Vigod declares that she receives royalties from UpToDate, outside the submitted work. Dr. Beck declares that she receives royalties for her Postpartum Depression Screening Scale published by Western Psychological Services. Dr. Inagaki declares that he has received a grant from Novartis Pharma, and personal fees from Meiji, Mochida, Takeda, Novartis, Yoshitomi, Pfizer, Eisai, Otsuka, MSD, Technomics, and Sumitomo Dainippon, all outside of the submitted work. Dr. Ismail declares that she has received honorarium for speaker fees for educational lectures for Sanofi, Sunovion, Janssen and Novo Nordisk. All authors declare no other relationships or activities that could appear to have influenced the submitted work. No funder had any role in the design and conduct of the study; collection, management, analysis, and interpretation of the data; preparation, review, or approval of the manuscript; and decision to submit the manuscript for publication.

AUTHOR CONTRIBUTIONS

Dipika Neupane, Brooke Levis, Parash Mani Bhandari, Jill T. Boruff, Pim Cuijpers, Simon Gilbody, John P. A. Ioannidis, Lorie A. Kloda, Dean McMillan, Scott B. Patten, Ian Shrier, Roy C. Ziegelstein, Liane Comeau, Nicholas D. Mitchell, Marcello Tonelli, Simone N. Vigod, Brett D. Thombs, Andrea Benedetti contributed to conception and design of this study. Jill T. Boruff and Lorie A. Kloda designed and performed database searches for this study. Dickens H. Akena, Rubén Alvarado, Bruce Arroll, Muideen O. Bakare, Hamid R. Baradaran, Cheryl Tatano Beck, Charles H. Bombardier, Adomas Bunevicius, Gregory Carter, Marcos H. Chagas, Linda H. Chaudron, Rushina Cholera, Kerrie Clover, Yeates Conwell, Tiago Castro e Couto, Janneke M. de Man‐van Ginkel, Jaime Delgadillo, Jesse R. Fann, Nicolas Favez, Daniel Fung, Lluïsa Garcia‐Esteve, Bizu Gelaye, Felicity Goodyear‐Smith, Thomas Hyphantis, Masatoshi Inagaki, Khalida Ismail, Nathalie Jetté, Dina Sami Khalifa, Mohammad E. Khamseh, Jane Kohlhoff, Zoltán Kozinszky, Laima Kusminskas, Shen‐Ing Liu, Manote Lotrakul, Sonia R. Loureiro, Bernd Löwe, Sherina Mohd Sidik, Sandra Nakić Radoš, Flávia L. Osório, Susan J. Pawlby, Brian W. Pence, Tamsen J. Rochat, Alasdair G. Rooney, Deborah J. Sharp, Lesley Stafford, Kuan‐Pin Su, Sharon C. Sung, Meri Tadinac, S. Darius Tandon, Pavaani Thiagayson, Annamária Töreki, Anna Torres‐Giménez, Alyna Turner, Christina M. van der Feltz‐Cornelis, Johann M. Vega‐Dienstmaier, Paul A. Vöhringer, Jennifer White, Mary A. Whooley, Kirsty Winkley, Mitsuhiko Yamada contributed primary dataset to this study. Dipika Neupane, Brooke Levis, Parash Mani Bhandari, Ying Sun, Chen He, Yin Wu, Ankur Krishnan, Zelalem Negeri, Mahrukh Imran, Danielle B. Rice, Kira E. Riehm, Nazanin Saadat, Marleine Azar, Tatiana A. Sanchez, Matthew J. Chiovitti and Alexander W. Levis contributed to data extraction and coding for the meta‐analysis. Dipika Neupane, Brooke Levis, Parash Mani Bhandari, Brett D. Thombs and Andrea Benedetti contributed to data analysis and interpretation. Dipika Neupane, Brooke Levis, Parash Mani Bhandari, Brett D. Thombs and Andrea Benedetti contributed to drafting the manuscript. All authors provided a critical review and approved the final manuscript. Brett D. Thombs and Andrea Benedetti are the guarantors; they had full access to all the data in the study and take responsibility for the integrity of the data and the accuracy of the data analyses. Supplementary Material 1 Click here for additional data file.

68 in total

1. Bias in sensitivity and specificity caused by data-driven selection of optimal cutoff values: mechanisms, magnitude, and solutions.

Authors: Mariska M G Leeflang; Karel G M Moons; Johannes B Reitsma; Aielko H Zwinderman
Journal: Clin Chem Date: 2008-02-07 Impact factor: 8.327

2. How reliable is depression screening in alcohol and drug users? A validation of brief and ultra-brief questionnaires.

Authors: Jaime Delgadillo; Scott Payne; Simon Gilbody; Christine Godfrey; Stuart Gore; Dawn Jessop; Veronica Dale
Journal: J Affect Disord Date: 2011-07-01 Impact factor: 4.839

3. The PHQ-9: validity of a brief depression severity measure.

Authors: K Kroenke; R L Spitzer; J B Williams
Journal: J Gen Intern Med Date: 2001-09 Impact factor: 5.128

4. Optimizing detection of major depression among patients with coronary artery disease using the patient health questionnaire: data from the heart and soul study.

Authors: Brett D Thombs; Roy C Ziegelstein; Mary A Whooley
Journal: J Gen Intern Med Date: 2008-09-25 Impact factor: 5.128

5. Validity of an interviewer-administered patient health questionnaire-9 to screen for depression in HIV-infected patients in Cameroon.

Authors: Brian W Pence; Bradley N Gaynes; Julius Atashili; Julie K O'Donnell; Gladys Tayong; Dmitry Kats; Rachel Whetten; Kathryn Whetten; Alfred K Njamnshi; Peter M Ndumbe
Journal: J Affect Disord Date: 2012-07-27 Impact factor: 4.839

6. Commentary: meta-analysis of individual participants' data in genetic epidemiology.

Authors: John P A Ioannidis; Philip S Rosenberg; James J Goedert; Thomas R O'Brien
Journal: Am J Epidemiol Date: 2002-08-01 Impact factor: 4.897

7. Postpartum depression screening at well-child visits: validity of a 2-question screen and the PHQ-9.

Authors: Dwenda Gjerdingen; Scott Crow; Patricia McGovern; Michael Miner; Bruce Center
Journal: Ann Fam Med Date: 2009 Jan-Feb Impact factor: 5.166

8. Comparative validity of three screening questionnaires for DSM-IV depressive disorders and physicians' diagnoses.

Authors: Bernd Löwe; Robert L Spitzer; Kerstin Gräfe; Kurt Kroenke; Andrea Quenter; Stephan Zipfel; Christine Buchholz; Steffen Witte; Wolfgang Herzog
Journal: J Affect Disord Date: 2004-02 Impact factor: 4.839

9. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies.

Authors: Patrick M Bossuyt; Johannes B Reitsma; David E Bruns; Constantine A Gatsonis; Paul P Glasziou; Les Irwig; Jeroen G Lijmer; David Moher; Drummond Rennie; Henrica C W de Vet; Herbert Y Kressel; Nader Rifai; Robert M Golub; Douglas G Altman; Lotty Hooft; Daniël A Korevaar; Jérémie F Cohen
Journal: BMJ Date: 2015-10-28

10. Accuracy of Patient Health Questionnaire-9 (PHQ-9) for screening to detect major depression: individual participant data meta-analysis.

Authors: Brooke Levis; Andrea Benedetti; Brett D Thombs
Journal: BMJ Date: 2019-04-09

2 in total

1. Selective cutoff reporting in studies of the accuracy of the Patient Health Questionnaire-9 and Edinburgh Postnatal Depression Scale: Comparison of results based on published cutoffs versus all cutoffs using individual participant data meta-analysis.

Authors: Dipika Neupane; Brooke Levis; Parash M Bhandari; Brett D Thombs; Andrea Benedetti
Journal: Int J Methods Psychiatr Res Date: 2021-05-12 Impact factor: 4.182

2. Sample size and precision of estimates in studies of depression screening tool accuracy: A meta-research review of studies published in 2018-2021.

Authors: Elsa-Lynn Nassar; Brooke Levis; Marieke A Neyer; Danielle B Rice; Linda Booij; Andrea Benedetti; Brett D Thombs
Journal: Int J Methods Psychiatr Res Date: 2022-04-01 Impact factor: 4.182

2 in total