Literature DB >> 36174001

Can short PROMs support valid factor-based sub-scores? Example of COMQ-12 in chronic otitis media.

Bojana Bukurov^1,2, Mark Haggard³, Helen Spencer⁴, Nenad Arsovic^1,2, Sandra Sipetic Grujicic^1,5.

Abstract

PURPOSE: Interpretable factor solutions for questionnaire instruments are typically taken as justification for use of factor-based sub-scores. They can indeed articulate content and construct validities of a total and components but do not guarantee criterion validity for clinical application. Our previous documentation of basic psychometric characteristics for a 12-item patient-reported outcome measure in adult chronic otitis media (COMQ-12) justified next appraising criterion validity of sub-scores.
METHODS: On 246 cases at 1st clinic visit, we compared various classes of factor solution, concentrating on the best-fitting 3-factor ones as widely supported. Clinical data offered two independent measures as external criteria: binaural hearing (audiometric thresholds measured via audiometry) for evaluating 'Hearing' sub-score, and oto-microscopic findings for the 'Ear discharge symptoms' sub-score. As criterion for the total, and for semi-generic 'Activities/healthcare' sub-score, the generic Short Form-36 item set offered a widely used multi-item criterion measure.
RESULTS: Factor model fit and parsimony again favoured a 3-factor solution for COMQ-12; however insufficient item support and the dominant 1st principal component of variation made sub-scoring problematic. The best solution was bi-factor, from which only the weighted total score met the declared convergent validity standard of r = 0.50. Two of the more specific sub-scores ('Ear discharge symptoms' and 'Hearing') correlated poorly with clinical findings and weighted binaural hearing thresholds.
CONCLUSION: The COMQ-12 total is acceptably content-valid for general clinical purposes, but the small item set, reflecting excessive pressure for brevity in clinical application, does not well support three criterion-valid factor-based scores. This distinction should be made explicit, and profile sub-scoring discouraged until good convergent and furthermore divergent criterion validities are shown.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36174001 PMCID： PMC9522295 DOI： 10.1371/journal.pone.0274513

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.752

Background

Chronic otitis media (COM) is a bothersome condition with implications for health-related quality of life (HRQoL), mostly due to embarrassing ear discharge and disabling hearing loss [1, 2]. Usually originating in childhood, it is more prevalent in certain sub-populations, and low- and middle-income countries [3]. Surgery remains the treatment of choice in COM patients and is usually directed at eradicating of the disease and/or reducing hearing disability. In active forms of the disease, surgery is considered necessary for preventing or arresting complications [4]. Balanced, informed decision for surgery requires a multi-aspect appraisal to be communicated to the patient for shared understanding and realistic expectations of treatment gains in COM’s direct impacts, and in generic HRQoL [5]. The Chronic Otitis Media Questionnaire-12 (COMQ-12) is a short questionnaire covering impacts of COM on HRQoL. It was adapted from items in three questionnaires previously used [6-8]. After the originating publication [9], further information on item scoring, reliability and factor structure came from Phase 1 of the present project [10]. COMQ-12 is largely oriented to direct pathology-linked consequences of COM, so is semi-specific in content. It is self-administered, with 12 questions (7 related to symptoms, 2 to daily activities, 2 to healthcare uptake and 1 overall QoL), each rated on a 6-point numerical scale. The total raw score thus ranges from 0 up to 60 (worst possible HRQoL). Most published COMQ-12 articles report translations (10 other languages to date) and useful small local-language reference samples [11-19]. Seeking fuller criterion validation might be worthwhile, given factor content validity and the results from two limited validation paradigms: non-nullity for the normal/abnormal (‘known-groups’) difference as minimal construct validity, and correlations with single global judgements as preliminary face-validation–visual analogue scale (VAS). The usefulness of all patient-reported outcome measures (PROMs) rests ultimately on the rigor of their validation, clarifying what they actually measure. Despite a multi-centre overview [20], the potential of COMQ-12 to support profiling via its sub-scores has not yet been critically and rigorously addressed. Proceeding to criterion validation for sub-scores pre-supposes their basing in a strongly interpretable and well-fitting factor solution; however even some of the better studies on COMQ-12 have not formally compared relative or absolute goodness-of-fit indices to select most appropriate solutions [21]. The strong 1st principal component (1st PC) of variation for COMQ-12 and the accompanying weak factor structure, had warned early of probably low divergent validities, so of problems for justifiable profiling [9]. For factor model interpretation, we had to clarify also the close relation seen between two of the sub-scores (the 2-item ‘Ear discharge symptoms’ score and 4-item ‘Activities/healthcare’), which fuse in 2-factor solutions [10, 21]. This article therefore formally documents factor solution fits, justifying choice of solution, and proceeds to criterion validation. In this, the expected and confirmed good fit for bi-factor models required that bi-factor, as well as simple CFA, be considered [22]. More generally, we have followed others’ methodological recommendations for attention to long-neglected method issues of power and effect size in otolaryngology, the source discipline [23], also attending to biases and missing data [24] and have further tested metrical assumptions in the item-scoring used.

Methods

Participants and data acquired

The Ethical Committee of the School of Medicine, University of Belgrade approved this 2-phase study (decision number 29/II-1), and all patients provided written informed consent. Consecutive adult patients diagnosed with COM (N = 246) were enrolled at a tertiary referral centre over 13 months. Patients filled two questionnaires at each visit, at baseline (1st visit, V1); the role of a 2nd visit (V2) for replicate baseline data on 60 patients [10] and variability reduction is further clarified in the S1 Appendix. Post-operative data at 6 months and 1 year are not used here. Supplementary demographic data included age, gender, level of education and distance from the capital where the tertiary centre is located. The auditory data available for validating the ‘hearing’ sub-score included mean air- and bone-conduction thresholds in decibels (Pure Tone Averages, PTAs in dB), for affected and unaffected ears, measured at 0.5, 1, 2 and 4 kHz as recommended [25]. Clinical examination gave details of the disease (onset, duration, form, laterality, previous operations, if any, etc.) before surgery, and the type of operation performed. Oto-microscopic findings underpinned the disease activity assessment (here dichotomised as active/inactive) for validation of the ‘Ear discharge symptoms’ sub-score.

Measures used

Previous publications on COM and COMQ-12 had suggested that extracting 3 sub-scores was slightly more justifiable than 2: (one around ‘Hearing’, one on activities and healthcare uptake, here called ‘Activities/healthcare’ and one on disease status, here called ‘Ear discharge symptoms’) [10, 20]. Since COMQ-12 total attempts semi-generic meaning by aggregation and the ‘Activities/healthcare’ sub-score has semi-generic aspects, we opted for a reputable large generic item set as reliable criterion measure for these two measures. Thus, the second questionnaire used was the translated Serbian version of the widely used HRQoL questionnaire, Short Form-36 (SF-36) [26, 27]. Standard SF-36 scoring modes have several noted deficiencies (see S4 Appendix). To optimise generic nature of an SF-36-based criterion measure item use, we maximised reliability of the adopted criterion measure by using the simplest formulation of a consistency-weighted total, the 1st PC of all SF-36 items. As expected for an item set emphasising physical mobility, the 1st PC form of total score was highly negatively skewed (after scaling, -0.964, with SE 0.155), and was correspondingly non-linearly related to other measures. SF-36 has no natural scale, so to avoid underestimation and false-negative results, we transformed the SF-36 1st PC with inverse natural logarithm (additive constant 1.4 within the bracket); this reduced skewness to near zero (-0.021; SE = 0.155). Inversion also turns positive the typically negative correlations of QoL with disease measures, making expectation for all correlations positive. This transformation also linearised the two hypothesised validity correlations with SF-36 (of COMQ-12 total, and ‘Activities/healthcare’ respectively, by 0.022 (up to 0.547) and by 0.034 (up to 0.419)). To maximise metrical precision, and to establish generality [28], we explored three approaches in item scoring (see S2 Appendix) for continuous measurement. For both COMQ-12 and SF-36, the missing-data rate was low; further details on imputation, data-reduction and scoring are given in S2 and S4 Appendices.

Statistical strategy, power and criterion measures

To restrict multiple testing issues, we hypothesised only four a priori convergent validity correlations. For COMQ-12 total score and the ‘Activities/healthcare’ sub-score, we adopted as criterion measure, the 1st PC of SF-36 items, and declared r ≥0.50 and r≥0.40 a priori as acceptable values respectively. The power scenario was not an a priori sample size calculation, but for r = 0.40, and N = 246, power against nullity (at alpha 0.001, 2-tail) is very high at (1-∝) = 0.9995, ie almost 100% power. For the ‘Ear discharge symptoms’ sub-score, the criterion assessment was dichotomised as active/inactive on the independently acquired clinical findings (oto-microscopy) with declared acceptable value ≥0.40. For COMQ-12 ‘Hearing’ sub-score, the criterion was the mean binaurally weighted air-conduction hearing threshold from pure-tone audiometry, with declared acceptable value r≥0.50. The binaural weighting is needed to rationally and empirically optimise the differing contributions of the two ears seen in asymmetrical hearing loss [29] (see S3 Appendix). For the main regular statistical procedures (e.g., correlations, t-tests, GLM (multiple regression), explanatory factor analysis (EFA), and Fisher tests for correlation differences [30], we used SPSS (Version 26.0, SPSS Inc, Chicago, Illinois). The normality of raw descriptives and of model residuals was inspected visually and numerically for skewness and kurtosis as preliminary and in reaction to deficient multivariate normality, as addressed later. Confirmatory factor analyses (CFAs) included bi-factor modelling with the maximum-likelihood estimator in SPSS-AMOS 26, and we followed contemporary standards for reporting and interpreting modelling results with the commonly used goodness-of-fit indices [31-33]. In CFA the ‘factors’ are technically latent variables, but we retain the 1-word more widely understood term.

Results

The descriptives in Table 1 show the sample’s main clinical and demographic characteristics. The 246 patients were mostly not highly educated, with predominantly longstanding and active disease, and a wide symmetrical age distribution with mean (M) 41.61 years and Standard Deviation (SD) 15.73. The raw total of COMQ-12 items, marking the general sample severity, had a usefully symmetrical distribution, with M 25.41 and SD 11.16. After assessment, 41 patients were treated conservatively and of the 205 (83%) with surgical treatment recommended, 167 accepted. Hearing thresholds by air conduction showed moderate conductive hearing loss on the (more) affected ear, with mean air conduction (PTA) 54.87 decibels (dB), SD 18.99 and bone conduction 27.18 dB (SD 13.97). For the less affected ear, values were only mildly impaired: air PTA 32.21 dB (SD 18.19) and bone PTA 22.12 (10.91) respectively.

Table 1

Basic descriptives of sample on main demographic and clinical variables.

Descriptor		N (%)
Distance of residence:	Up to 100 km	135 (54.9)
(from Belgrade)	More than 100 km	91 (37.0)
	Missing	20 (8.1)
Education:	Primary, Lower secondary, Missing	129 (52.4)
	Upper secondary, post-secondary, 1st stage tertiary	117 (47.6)
Disease activity stage:	Inactive	81 (32.9)
	Active Mucosal	71 (28.9)
	Active Squamous	94 (38.2)
Duration of disease:	1–8 years	99 (40.2)
	8–32 years	137 (55.7)
	Missing	10 (4.1)

Of the 246 patients, 47.2% were male; 72.4% had unilateral and 27.6% bilateral disease. Stated duration of disease was initially coded logarithmically: 1–2, 2–4, 4–8 years etc. but for simplicity, was dichotomised (as here) at nearest-to-median boundary.

Preliminary EFA solutions

Unscaled Kaiser-Meyer-Olkin value of 0.767 and Bartlett Chi-sq value of 945.26, df = 66, p<0.001, showed that factor-analysis was justified. The 1st PC again dominated, explaining 34.41% of variance. To avoid a large detailed 3x3 (scoring basis, factor structure) tabulation, we first scoped factor structure issues on 2-Factor (2-F), 3-F and 4-F solutions in exploratory factor analyses (EFA rotated Varimax), based on the raw item scoring (ie numerical rating). This gave respectively solution Rsq values (unscaled) of 0.472, 0.568 and 0.645, with respective rotated last-extracted eigenvalues (LREV) of 2.77, 1.67 and 1.69. As expected, 2-F gave a simple but not very useful solution interpretable as ‘Hearing’ versus the rest, whilst 4-F separated the ‘Activities/healthcare’ items, attracting item 6 (dizziness) to the former; this is unsatisfactory, and reflects overfitting in EFA when an entire item set is forced into use. For CFA, we therefore focussed on variants of 3-F as the preferred structure a priori from the literature and on scaled versions of items (details in S2 Appendix).

Formal comparison of most relevant factor solutions using CFA

The EFA results seeded the CFAs, followed by deletion of links for lowest item loadings (as standardised regression weight—SRW). The minimum retained link SRW in the simple CFA was 0.16, for Q6 on the ‘Ear discharge symptoms’ factor. Both simple CFA and bi-factor CFA were be considered, because in a dataset with strong 1st PC, bi-factor generally improves both fit and the capture of fine structure. In the 3-F simple CFA model, two cross-loading items, question 6 (‘dizziness’) and 12 (‘ear problems get you down?’) were retained with respective links (to ‘Hearing’ also ‘Ear discharge symptoms’, and to ‘Hearing’ also ‘Activities/healthcare’), because their deletion degraded the model fit, and so would erode sub-score support for profiling. All three inter-factor correlations (IFCs) were retained in simple CFA, the lowest being r = 0.338, confirming remaining factor interdependence in CFA. Item contents and SRW loadings in simple CFA for 2-F and 3-F solutions are given in Table 2 (upper two fields), including the retained cross-loadings. As in EFA, the ‘extra’ factor in 3-F splits the second factor from 2-F into ‘Ear discharge symptoms’ and ‘Activities/healthcare’. The 3-F simple CFA gives fair, not excellent, fit index values (Chi sq = 138.145, DF = 49, RMSEA = 0.086; AIC = 220.145, AIC saturated = 180, delta AIC = 40.145; CFI = 0.906). Qualitatively, this pattern differs little from the previously published simple CFA for Spanish COMQ-12 [21] and it approaches an earlier standard of RMSEA 0.08. Its falling short of the current standard of excellence (expressed as RMSEA < 0.05) is most likely due to the paucity and some inhomogeneity of items [34]. The fit for the simple 2-F CFA solution (Chi sq = 243.147, DF = 51, RMSEA = 0.124, CFI = 0.797; AIC = 321.15, AIC saturated = 180, delta AIC = 141.147) was not acceptable, with parsimony- adjusted delta AIC poorer by about 100.0 than the one for 3-F.

Table 2

COMQ-12 item loadings expressed as standardised regression weights (SRWs) for three factor structure models in CFA.

Item →	1	2	3	4	5	6	7	8	9	10	11	12
Solution & factor label ↓
Simple 2-F
F1 ‘Hearing’			0.830	0.882	0.538	0.244	0.500					0.355
F2 ‘Activities, healthcare plus Ear discharge symptoms’	0.483	0.419				0.222		0.501	0.434	0.725	0.803	0.395
Simple 3-F
F1 ‘Hearing’			0.826	0.879	0.543	0.293	0.506					0.395
F2 ‘Activities/ healthcare’								0.475	0.411	0.770	0.847	0.344
F3 ‘Ear discharge symptoms’	0.855	0.757				0.160
Bifactor 3-F
General	0.390	0.292	0.611	0.681	0.642	0.477	0.629	0.346		0.292	0.358	0.618
F1 ‘Hearing’			0.616	0.555		-0.085
F2 ‘Activities/ healthcare’								0.327	0.386	0.725	0.760	0.227
F3 ‘Ear discharge symptoms’	0.764	0.695

Abbreviated keywords for questionnaire items: 1 Ear drainage; 2 Smelly ear; 3 Hearing at home; 4 Hearing in noise; 5 Ear discomfort; 6 Dizziness; 7 Tinnitus; 8 Restricted activities; 9 Unable to wash; 10 General practitioner visits for ear problems; 11 Use of medicines for ear problems; 12 Ear problems ‘get you down’. Double-row entries per column represent cross-loading for simple versions of CFA, but for bi-factor analysis the 9 out of 12 dual instances represent dual loadings on the general and one specific factor. The bi-factor general link to item 9 had to be suppressed to enable the model to run, and two weak loadings from any specific factor to Q5 or to Q7 likewise. Of three low loadings (SRW<0.15) considered for dropping for simplicity, ‘ear discharge symptoms’ to Q6 and ‘hearing’ to Q12 were then dropped. The third, although low, ‘hearing’ to Q6, had to be retained to enable the model to run, the low or negative sign making it a contrasting anchor not to be considered part of hearing disability. In the bi-factor model, ‘Hearing’ and ‘Ear discharge symptoms’ both become under-sampled, with only 2 strongly loading-items. The bi-factor (3-F plus general factor) solution in Fig 1 and the lowest field of Table 2 approached excellent fit (Chi sq = 90.048, DF = 44, RMSEA = 0.065, CFI = 0.951). The bi-factor solution also achieved remarkably high parsimony with AIC ‘default’ = 182.048, AIC saturated = 180.0 (delta AIC = 2.048). For readers unfamiliar with modelling indices and their properties, this means roughly that the bi-factor fit is about as good as could be reasonably expected, given its high parsimony. Therefore, further statements refer to this model and multivariate normality was examined for the preferred solution. As there was a slight kurtotic infringement of multivariate normality (Mardia multivariate kurtosis 6.39, CR 2.74) [35], we bootstrapped the model 1000 times for re-sampled ‘conservative’, ie distribution-free, p-values. There were only two items with marginal p-values under normality assumptions, in the ‘Hearing’ factor, and these were also highly kurtotic. Taking the cross-loading item 6 with SRW = 0.085 as reference, we accepted the retention of these links based on their bootstrapped p-values of 0.00039 and 0.00050. The other 10 COMQ-12 items all had loadings (SRW) above 0.35, many above 0.5, with the exception of one general factor link at 0.23. All bootstrapped p-values undercut p = 0.005 for specific links, and 0.025 for general factor links. Thus, multivariate kurtosis is not a concern for the acceptance of model inks. This summary refers to the better-fitting, bi-factor, model but consistent comparisons held for simple CFA.

Fig 1

Simplified graphic of CFA for Bi-factor solution to COMQ-12 items.

An IFC or alternatively regression link (SRW = 0.273), between factors 2 and 3 (ie ‘Ear discharge symptoms’ and ‘Activities/healthcare’) was necessary to permit convergence of model estimates, recalling the competitiveness of 2-F EFA solutions. However, a 2-F bi-factor model had to be rejected structurally; an illogical sign reversal between the above two factors made the resulting scores unsuitable for interpretation and clinical application.

Convergent and divergent validation of total and sub-scores

Table 3 summarises as Pearson correlations the criterion validities of COMQ-12 total and sub-scores, according to simple CFA and bi-factor models. Emboldening represents the prediction and requirement for highest correlations in their row, in showing convergent validity. Obtained emboldened correlations are indeed mostly the highest, but they are modest overall, and neither solution produces r>0.4 for ‘Ear discharge symptoms’. For defining a total score, bi-factor versus simple solution makes little difference to the validity correlation values with SF-36; both exceed declared acceptability cut-off for total. The bi-factor solution is necessary to allow correlation of ‘Activities/healthcare’ with SF-36 to exceed the declared cut-off.

Table 3

Pearson correlations (with 95% CIs) between external criterion variables and COMQ-12 total and specific factor scores.

External criterion variable →	Otomicroscopic findings	Weighted binaural aPTA	SF-36 1st PC total
QUESTIONNAIRE VARIABLE, & Version of COMQ-12 score ↓	Active vs. Inactive	Transformed	Transformed
TOTAL 1st PC total	0.207 (0.084, 0.323)	0.233 (0.111,0.348)	0.547 (0.453, 0.629)
TOTAL Bi-factor general	0.167 (0.043, 0.287)	0.247 (0.126, 0.361)	0.566 (0.474, 0.645)
HEARING Simple CFA	0.177 (0.053, 0.295)	0.315 (0.198, 0.424)	0.456 (0.350, 0.549)
HEARING Bi-factor	0.112 (-0.013, 0.234)	0.251 (0.130, 0.365)	0.043 (-0.082, 0.167)
ACTIVITIES/ HEALTHCARE Simple CFA	0.122 (-0.003, 0.244)	0.125 (0.0002, 0.247)	0.419 (0.310, 0.517)
ACTIVITIES/HEALTHCARE Bi-factor	0.058 (-0.068, 0.181)	0.023 (-0.102, 0.148)	0.220 (0.098, 0.336)
EAR DISHARGE SYMPTOMS Simple CFA	0.357 (0.242, 0.461)	0.130 (0.005, 0.251)	0.224 (0.102, 0.340)
EAR DISCHARGE SYMPTOMS Bi-factor	0.335 (0.219, 0.442)	0.006 (-0.119, 0.131)	-0.014 (-0.139, 0.111)

All variables use averages of Visit 1 and 2 data for reliability reasons (except for SF-36, see text and S1 Appendix). Correlations with the declared criterion measures (heads of columns) are emboldened. Weighted binaural auditory thresholds (air-conduction PTA) are the saved predicted values from the respectively most appropriate and fair binaurally weighted threshold models as specified in S3 Appendix. The SF-36 values denoting HRQoL have been inverted and transformed in normalising the SF-36 distribution (S4 Appendix). For bi-factor values only (so using the lower row for each field), six formal comparisons of correlation difference for documenting divergent validity were made between bold and non-bold entries using Fisher’s Z in SPSS 26 [36]. Disregarding the totals and descending the two non-bold entries in each column in descending order of the other rows, these correlation magnitudes differed from the respective ((ie not predicted bi-factor) variables’ emboldened values as follows. For Col 1: (Z = 2.400, p = 0.017; Z = 3.775, p<<0.001), for Col 2: (Z = 2.4444, p = 0.015; Z = 2.54, p = 0.011), and for Col 3: (Z = 1.894, p = 0.058; Z = 3.124, p = 0.002). We replicated this set of tests with the t-based Williams test [28], and only one difference in p greater than 0.0005 in the 4th decimal place of the p-value was obtained. Assessing divergent validities under bi-factor solution demands six tests for correlation difference (See Table 3 footnote for Fisher Z and p-values). Divergent validity is shown when the non-emboldened correlations in the same row are lower than the corresponding emboldened ones. There is no general convention for size of required difference, and PROM development rarely gets this far; we adopted correlation magnitude differences of ≥0.300 as good, and ≥0.200 as fair. For-bi-factor sub-scores, in Column 1 the 0.335 at the bottom left differs from the two other bi-factor rows’ entries by > 0.300 (good divergence), and the significance tests in the footnote generalise this example. Simple CFA does not achieve that. Improved divergent validity with bi-factor is expected from its co-extraction of a general factor, central to bi-factor purpose and usefulness. Indeed, under bi-factor solution, the required correlation difference for ‘hearing’ exceeds 0.200 in one instance of PROM variable not required for convergent validity (vs generic SF-36) but not the other (vs ‘Ear discharge symptoms’). The achievement of modest divergent validity under bi-factor also brings lowered convergent validity as might be expected (r at 0.251 for bi-factor ‘Hearing’ with binaural threshold), well below the declared cut-off 0.50. For ‘Activities/healthcare’ a similar competition or trade-off is seen between the two classes of validity, one which bi-factor solutions clarify. In summary, whilst some instances of divergent validity are also present and are slightly favoured by the best (bi-factor) solution as expected, the divergence is patchy across measures and comes at the expense of convergent validity.

Discussion

Factor structure comparisons as pre-requisite to validation

Rigorous initial psychometric approaches to reliability, model fit and validity avoid wasted effort at the later stage of evaluation for clinical applicability, so do not conflict with diverse clinimetric ideals, even if delays for methodology frustrate clinicians [37]. From preceding studies and content validity considerations [10, 20], a solution with 3 relatively specific sub-scores was always promising for the 12-item set COMQ-12. However, the separate and important issue of what sub-scores really measure and whether 12 items could sample three obtained factors adequately for profiling had not previously been addressed. The best fit is produced by bi-factor solutions [31, 32] which suggests that sub-score profiling is not well supported in COMQ-12. On a small item set, bi-factor inevitably raises stability issues, and especially so when the item support for the structurally optimal number of factors becomes slight and not evenly spread. We showed this as a problem already with COMQ-12, even before the extra demands from the bi-factor solution. The contrasting model classes raise the issue of what admixtures of a generic construct versus common method bias [38] are reflected in 1st PC versus the bi-factor solution’s general factor. Here the only model with truly good fit was bi-factor, making it the default preferred structure for addressing both convergent and divergent validity. Bi-factor Root Mean Square Error Approximation (RMSEA) just failed to undercut the new excellence standard of ≤0.05 [33, 34], but the more stable Comparative Fit Index (CFI>0.950) met standard. The normalisation (in effect similar to Rsq) within the CFI makes CFI more stable than RMSEA across extreme conditions (such as having few fitted variables), hence perhaps more appropriate as a general guide. We confirmed expectation that bi-factor’s better descriptive account would also favour divergent validity by removing covariation with non-specific items referring to tinnitus and dizziness. The results for ‘Hearing’ indeed achieved this, but unfortunately the consequent reliability loss in the fewer items remaining with (high) loadings on the factors then compromised convergent validity. For both simple and bi-factor versions of a ‘Hearing’ sub-score in this data set, the validity correlations were low, even compared to the typically modest classical magnitudes of correlation found when audiometric hearing thresholds are used to predict reported hearing ability [39, 40]. For the ‘Ear discharge symptoms’ sub-score, the declared value for convergent criterion validity was also not met. Favourable recommendations for use at this point have to be restricted to the COMQ-12 total, and the sub-score ‘Activities/healthcare’, which have shown acceptable convergent criterion validity and for the latter some divergent validity.

General usefulness of bi-factor structures

The bi-factor general factor does not differ greatly from the 1st PC in this sample (r = 0.921), despite some items having low loading on it (six of twelve below 0.4, Table 2) so it can here be interpreted similarly as a weighted total. Bi-factor separation of generic components (including correlated response biases) from specific components offers widespread advantages in questionnaire measurement, and in health this distinction is natural and fundamental. Bi-factor solutions should also more flexibly serve a multiplicity of explicit measurement aims within the application-centred Clinimetric approach [37]. A recent well-executed Sino-Swiss study on a similar COM semi-generic QoL questionnaire used the bi-factor technique with a larger item set (M = 21) on an adequate sample size (N = 208) [22]. Those authors did not formally contrast solutions eg bi-factor against simple CFA, and only undertook preliminary forms of validation. However, their similarly good and interpretable bi-factor fit to ours, adds supports a bi-factor approach to PROMs in COM. Their not reporting some of the item sampling problems noted here probably reflects the model support of 4 specific (non-general) profiling factors from 21 items (average ratio 5.25), consistent with our suggestion that, together with the imbalance towards ‘Hearing’ in simple CFA, 3 factors from 12 (ratio 4.0) for COMQ-12 is too thin a spread to permit sub-score profiling.

Neglect of divergent validity

For a total score, the lack of desired higher correlation with its criterion measure (than for the correlations with criterion measures adopted for sub- scores composing it) is inevitable and perhaps not too serious. This might explain the inattention in the literature to this more challenging form of validation, due to a wish to avoid discouragement through lack of divergence, in turn explaining the lack of development work on how divergent validity can be improved. Where the preliminary scientific goal is mere non-nullity, we see a major scientific threat of un-usable, uninterpretable, or ungeneralizable ‘positive’ findings, widely interpretable as publication bias [41]. The larger sample size needed for showing divergent validity via differences among modest correlations is one challenge. The pervasive item inter-correlation due to common method bias and response stereotypy in questionnaires is another, which without computer administration (ie making the items not co-visible) is hard to address. There is a pervasive bias towards high correlations when a single responding method is used [42], so high predicted (convergent) correlations may exaggerate the impression of a strong measure, unless a critical and analytical set of principles is applied, embracing bi-factor analyses and divergent validity. These principles seem not to have been heeded despite multi-disciplinary evidence: a decade ago, a similar example in orthopaedic PROMs reported low divergent validity of sub-scores from an otherwise ‘promising’ questionnaire [43].

Strengths, limitations, and further research

The present work from the largest single-country COMQ-12 sample to date is the first to have explicitly and quantitatively compared 2-F with 3-F CFA, and simple with bi-factor CFA, also to address criterion validity. For both reliability and clinical relevance, we only proceeded to criterion validation after explicitly justified model choice. To avoid other obstacles to application, we also necessarily addressed linearity in two of the validity relationships, and handled skew in the same transformation, giving consequently more fair–but also stronger—assessment of correlation magnitudes. As a limitation, the under-sampling of content factors also limits the strength of methodological comparisons. We have avoided confirmation bias, critically and impartially examining a now widely translated and apparently used instrument. Given these rigorous provisions, some difficulties for 3-factor COMQ-12 sub-scores, particularly ‘Hearing’, cannot be brushed off. We would encourage further criterion validation for COMQ-12, but the present evidence for ‘Ear discharge symptoms’ and ‘Hearing’ is strong enough to recommend first adding items to the former and restricting the latter to hearing disability items. The other items presently accommodated there (dizziness and tinnitus) could be represented either via bi-factor general or in some other homogeneous subset yet to be defined, but those make the present ‘Hearing’ sub-score from EFA or simple CFA uneasy mixture. For all PROM development, we would advocate explicitly reporting the trade-offs between interpretation, fit and parsimony of alternative models, and the use of demonstrably near-optimal models to assess the various forms of validity. Content validity form an interpretable factor solution is just a preliminary stage. The COMQ-12 item set in adult chronic otitis media is better modelled with 3 factors than 2 or 4; all 3-factor solutions examined gave reasonable interpretability, but only a bi-factor solution gave a high standard of fit. The total COMQ-12 score showed acceptable convergent criterion validity for general clinical use as impact summary, but not sufficient divergent validity for research purposes. None of the 3 sub-scores achieved a satisfactory balance of convergent with divergent validity for either simple or bi-factor solution. Too limited item support, of which there was separate evidence, and non-specific response biases seem to be the main explanations. Cautions over item support and other statistical constraints are needed by potential PROM users (eg about further forms of validity still needing to be demonstrated). Qualitative confirmation of overall content validity by an interpretable solution is a necessary start, but demonstration of convergent and divergent criterion validities is required for any sub-score profiling advocated.

Incorporation of replicate data from Phase 1 study and the use of dual-visit baseline.

(DOCX) Click here for additional data file.

Scoring and handling of missing data.

(DOCX) Click here for additional data file.

Definition of an appropriate binaural average hearing level (aPTA) for asymmetric hearing loss.

(DOCX) Click here for additional data file.

Composition of total SF-36 score for present purposes.

(DOCX) Click here for additional data file.

Minimum data set for present analysis.

(SAV) Click here for additional data file. 7 Jun 2022

PONE-D-22-07528

Can short PROMs support valid factor-based sub-scores? Example of COMQ-12 in chronic otitis media

PLOS ONE Dear Dr. Bukurov, Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. ============================== Please pay attention to the comments raised about statistical analysis and respond appropriately ============================== Please submit your revised manuscript by Jul 22 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file. Please include the following items when submitting your revised manuscript:

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter. A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'. A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'. An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'. If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols. We look forward to receiving your revised manuscript. Kind regards, Claudio Andaloro Academic Editor PLOS ONE Journal Requirements: When submitting your revision, we need you to address these additional requirements. 1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf 2. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability. Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized. Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access. We will update your Data Availability statement to reflect the information you provide in your cover letter. 3. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide. 4. Please include your full ethics statement in the ‘Methods’ section of your manuscript file. In your statement, please include the full name of the IRB or ethics committee who approved or waived your study, as well as whether or not you obtained informed written or verbal consent. If consent was waived for your study, please include this information in your statement as well. 5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice. [Note: HTML markup is below. Please do not edit.] Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 2. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: I Don't Know Reviewer #2: Yes Reviewer #3: No ********** 3. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No Reviewer #2: Yes Reviewer #3: No ********** 4. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: Yes ********** 5. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Dear authors, I would like to congratulate you for putting together such an interesting study. Validation of questionnaires, especially with regards for clinical relevance, are needed. I have a few comments: Introduction: "Nowadays, surgery is mostly elective in COM (to reduce hearing disability) and rarely obligatory (to prevent or arrest complications)" --> I fully disagree with this statement. This has no scientific basis whatsoever. There is abundant evidence that intracranial complications of chronic otitis media, especially suppurative / cholesteatomatous are a significant cause of death. Lacks reference and a better explanation. Introduction: "Balanced decision for surgery versus conservative management requires a multi-aspect appraisal to be communicated to the patient for shared understanding of realistic treatment gains in COM’s main impacts, and in generic HRQoL" - Again - although shared decision has a role in uncomplicated cases, that is not always the case. Again lacks references. Methods, "participants and data acquired" - I assume these patients had unilateral COM, is that correct? Must be stated. Reviewer #2: This is a well-written and technically sound paper that has explored the usefulness of patient-reported outcome measure through a questionnaire COMQ-12. The statistical methodology holds rigor and adds to its methodology. Reviewer #3: This appeared to be a primarily straightforward manuscript (wrt. statistical analysis and data presentation). The material presented appeared insightful. I have some minor questions: 1. A sample size justification was provided, based on r = 0.40. However, (a) it is not clear whether it was based on the "primary response" variable (which should always be the case when presenting sample size/power statements, (b) the name of the statistical test was missing, (c) whether it was a 2-sided test that was used, and (d) no idea why an alpha = 0.001 was considered! 2. The statistical strategy section should also state what needs to be done when normality of model residuals were violated. Explain clearly, if continuous data, or discrete data is the focus. If continuous, assumptions of multivariate normality is immediate. Did the authors check whether that was satisfied, during actual data analysis? 3. Looks like data were generated for multiple time-points/repeated measures; so, why was some (repeated-measures) ANOVA-type approach not conducted, in addition? Some mixed-model approach would also have been appropriate; why was that not done? I was looking for some clarification. ********** 6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Reviewer #3: No ********** [NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.] While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

29 Jul 2022 Response to reviewers’ views We thank the reviewers for their appreciation of the importance of the scientific issues addressed in our manuscript on COMQ-12, and for offering suggestions which definitely improve it. Below is our point-by-point response to each of the main comments. All changes that we have made from the original manuscript text are also given in the Tracked Changes (TC) version. Page- & line-pointers added here refer to the changes in the TCs- accepted version of the manuscript. Changes include usual general polishing for greater clarity and some cutting for wordcount. The latter stems partly from needing to make additions to address reviewers’ points, and this applies also to Supplementary information Appendix S2, the only Appendix in which the changes are substantial, but those are partly due also to the requirements of the Minimum Data Set. This minimum data set for present analysis according to PLOS instructions is uploaded now as Supplementary Information S5 Appendix and has an introduction in S2, naturally related to the issues explaining the general data structure and nature of certain variables. There are no new introductions of substantive issues beyond those formerly present or raised by the reviewers. However, a single sentence clarifies the somewhat obvious, that the bi-factor general factor approximates simple 1st PC from EFA by giving their high correlation coefficient (> 0.90; P 12 line 297); this might perhaps be thought a substantive point as an r-value is a kind of result, and this was not mentioned before. Reviewer Comments in Italic, Authors’ Responses in Roman font Reviewer #1: Dear authors I would like to congratulate you for putting together such an interesting study. Validation of questionnaires, especially with regards for clinical relevance, are needed. I have a few comments: Introduction: "Nowadays, surgery is mostly elective in COM (to reduce hearing disability) and rarely obligatory (to prevent or arrest complications)" --> I fully disagree with this statement. This has no scientific basis whatsoever. There is abundant evidence that intracranial complications of chronic otitis media, especially suppurative / cholesteatomatous are a significant cause of death. Lacks reference and a better explanation. We appreciate the endorsement of the need for proper validation. We had wanted to respect clinical relevance in the choice of criterion measures, and think that this was achieved, as the view seems to imply. We entirely agree that the quoted general sentence in the Introduction was unfortunately sweeping, and poorly worded in our revision of earlier draft, seeking brevity. It is readily improved by a clearer explanation and a reference. We had obviously not wanted to characterize all surgery, including for active disease, as elective. Of course, there is abundant evidence that both active forms of disease, mucosal and squamous, can lead to complications of COM (according to some series, up to 50%)*. But we here wished to convey a dimension of relative urgency of the recommendation for surgery and hope we have now achieved a good wording for that. In active COM, surgery is necessary, of course, but if not acute and progressing, it can be planned quasi-electively for a later date, and rarely is it urgent and obligatory (as, for example, it is with expected early development of extracranial or intracranial complications.) We are sorry for not seeing how our wording could have led to a misunderstanding. The rewritten sentence in the main text (Page 3 lines 51-56) tries to express this very succinctly, as it is a well-rehearsed issue. *Singh B, Maharaj TJ. Radical mastoidectomy: its place in otitic intracranial complications. J Laryngol Otol 1993; 107(12):1113–18. Introduction: "Balanced decision for surgery versus conservative management requires a multi-aspect appraisal to be communicated to the patient for shared understanding of realistic treatment gains in COM’s main impacts, and in generic HRQoL" - Again - although shared decision has a role in uncomplicated cases that is not always the case. Again lacks references. This is really a complementary aspect of the preceding point. We agree that shared decision for surgery is particularly appropriate in inactive forms of the disease. The sentence has now been modified and a relevant reference added; we think the intended dimension is more clearly conveyed now. Methods, "participants and data acquired" - I assume these patients had unilateral COM, is that correct? Must be stated. We appreciate that the reader might expect to see some mention of laterality within the general sample descriptions in Methods, so we have added that information on laterality was collected. The numbers of unilateral or bilateral disease of our patients are given in the Results section, in footnotes of Table 1. We had not dwelled on the categorical terms ‘unilateral/bi-lateral’, as this clinical shorthand description loses the more important information in the distributions of hearing levels (or other properties) on the two ears. Reviewer #2: This is a well-written and technically sound paper that has explored the usefulness of patient-reported outcome measure through a questionnaire COMQ-12. The statistical methodology holds rigor and adds to its methodology. We appreciate this awareness of the careful reflection and effort that went into this work. We did our best to make a rigorous, balanced and not un-critical evaluation of the scientific worth and applicability of COMQ-12, emphasising the sometimes overlooked psychometric development issues. Reviewer #3: This appeared to be a primarily straightforward manuscript (wrt. statistical analysis and data presentation). The material presented appeared insightful. I have some minor questions: 1. A sample size justification was provided, based on r = 0.40. However, (a) it is not clear whether it was based on the "primary response" variable (which should always be the case when presenting sample size/power statements, (b) the name of the statistical test was missing, (c) whether it was a 2-sided test that was used, and (d) no idea why an alpha = 0.001 was considered! We take the generally positive introduction from this reviewer in the spirit apparently intended, and we reciprocate. However, we cannot completely avoid commenting on some assumptions behind the points in the queries in explaining why we wrote as we did, or did not write, formerly. Some of these issues are not so small as the mere requests for more detail in reporting Methods imply. We first try to answer 1 (a)-(d) as briefly as simply and factually as possible, and later comment on how two, (a)-(d) are directed at other targets than those in the article. (a) The type of power statement given is of generalised statistical power for an effect size (for which the correlation coefficient is the widely familiar example in an association). It is not an a priori sample size calculation grounded in a particular measure on a particular score. There is not space here to go into the mistaken view sometimes expressed that no other form of calculation than the latter should be done. In the past era of clinical trials, there was typically one dominating overall trial question (“is this treatment effective?”), for leading into policy decisions; this question was assumed to be most simply, if not entirely adequately, answerable by straightforward ‘significance’ of a single difference on a single measure. That principle became crystallised as a good-practice rule to help head off cherry-picking or confusion in reporting and applying trial results. The rule of a single predominant outcome measure (what we have taken ‘primary response’ to mean here) is in general long gone, but it remains true that clarifying relative importance of few study questions assists interpretation, as does pre-declaring the effect size magnitudes justifying conclusions. These things we have done. We had not intended to imply an a priori sample-size calculation, which is not called for in the study, as we clarify below, but to summarise and communicate a power scenario, and the modified wording emphasizes this more strongly. This study is not a clinical trial but a primarily correlational, psychometric study about a related set of measures from an item pool, where the concept of most important outcome measure does not have any obvious direct counterpart. No ‘response’ (ie mean pre-/post-treatment difference on an outcome measure after treatment) is reported here. The plurality of measures involved is so essential to the study question (supportability of multi-score profiling) that it figures in our title. We do, somewhat conventionally use the overall score (ie principal component, PC, as weighted total) as conceptual pivot and procedural step, we do conclude with recommending its use as being empirically supported for simple reasons of generality and reliability, and that recommendation might lead to use of COMQ-12 becoming concentrated only on this total. However, to have translated this generality into policy-relevant importance for a sample-size calculation based on a marginal p-value such as p=0.05, would have adversely restricted N for the whole set of issues in the psychometric study; it would have under-powered the study for the issue(s) actually at stake. The participant sampling adequacy is safe at N=246 and helps compensate for the higher variability (hence lower power) with shorter sub-scores. The explicit emphasis throughout is on the numbers of items per measure, which, counteracting some neglect in the PROMs literature, we here show is barely sufficient for the COMQ-12 under evaluation. This chief issue ‘Is there support-for-profiling with sub-scores?’ was prominently set out in the main text, but we have re-emphasised it in several places, chiefly in the strategy section P 5 lines 121-128 to head off any misunderstanding by readers. (b) The simple statistical test for non-nullity of correlations (Fisher, a transformation of r to Z) is in widespread use and built into many statistical packages. We used the SPSS syntax, which is covered by the general SPSS declaration in the text; this declaration has become customary as default citation, and we did not consider it necessary to burden the bibliography with a reference to this almost universally used eponymous test. Nonetheless, we have now introduced small changes in wording to both text and table 3 footnotes to capture this. Extensions of Fisher’s Z are available for differences in correlation value, which is a less familiar analysis meriting more specification. We restrict that type of test to the quantification of divergent validity, largely covered in the footnote to Table 3, but have added there a background theoretical reference also for the issues in correlation differences. (c) The short answer is: ‘necessarily, 2-tailed’. Even when some statistical authorities used to recommend conditions under which a single-sided test could be used, there were painful disputes over defining what those should be. The distinction introduces incentives to make results look more favorable than they were by producing diverse arguments for analyses meeting criteria for using 1-tailed tests. Many Bayesians regard the concept as meaningless, and that also has contributed to the use of 1-tail tests becoming less common than formerly. For these reasons, it seemed redundant to declare that all tests were 2-tailed, but we have now done this at our first given p-value, located at the power scenario referred to above. (d) There is a wealth of literature (suggested search-term ‘Replication Crisis’) from the last 15 years complaining much more vocally than in the previous 50 of inadequate statistical standards in biomedical and social science. This literature agrees that whatever the solution, p-values of 0.05 represent a generally low standard of evidence. In the light of this clean-up, any scientist is justified in aiming higher. There are circumstances where an a priori intention for alpha 0.001 could be appropriate, eg where high certainty is required. However, as implied under (a) above, our descriptive power scenario merely used this as one power parameter traded off against the other parameters in a scenario. We kept the values chosen within familiar ranges, so as to communicate the actually high power to readers who might not be thoroughly versed in the purpose and processes of power calculation. Throughout, we have allocated highest scientific priority to what should be considered useful effect magnitudes in respect of correlation coefficients, a question which is not N-dependent (apart from minor small-sample formula adjustments such as possible use of Williams t in preference to Fisher’s Z). Most importantly, as is present also in a priori sample size calculations, we made an explicit declaration of a set of value (s) pivotal for drawing conclusions. Those values have not been questioned by the reviewers and are transparent for debate. The power (ie against r=0) for the declared target absolute correlation values is extremely high and we do not see an alternative approach to such use of a set of calculation parameter values for communicating this fact which we felt it important for overall interpretation to report. We furthermore in the final table report the procedures appropriate for statements about differences in correlation magnitude for divergent validity, although our main conclusions do not draw heavily on these. There is a prevalent bad habit (see for example Nieuwenhuis et al 2011) for authors to proceed, eg from presence versus absence of a ‘significant’ effect, to make difference-based or interaction statements without the appropriate statistical test, one for which their studies are mostly grossly underpowered. In the event, 5 of the 6 correlation differences in Table 3 are respectably significant and one marginal. The exact power for these differences is not a major issue. We had not declared that divergent validity, hence correlation differences, was our sole concern, but a supplementary consideration to convergent validity, so we do not deploy a second power scenario. More importantly, to head off cherry-picking of conclusions, we state that the magnitudes of the differences, although mostly conventionally significant by virtue of adequate sample size, are actually quite small, Accordingly, we conclude in Results and Conclusions only that some divergent validity is shown by the bi-factor solution, but modest and inevitably detracting from convergent validity as measured. The implication is that other instruments, and especially those constrained to be very brief for clinical reasons, need to be based on similarly rigorous and dispassionate validations. We imply that mostly they are not so, but the thrust of the article is not overtly or polemically critical. (2) The statistical strategy section should also state what needs to be done when normality of model residuals were violated. Explain clearly, if continuous data, or discrete data is the focus. If continuous, assumptions of multivariate normality is immediate. Did the authors check whether that was satisfied, during actual data analysis? There is indeed some value in an explicit statement of measurement type, so we have simply inserted the words ‘for continuous measurement’ at one point. Our formerly omitting it is explained by its being the ‘un-marked’ common default instance in psychometrics, with categorical being the type requiring the explanation as inevitable etc; the Pearson correlations and conventional CFA require that measurement be continuous, and such omission where redundant is common in scientific writing. The account of item scaling to enhance the equal-interval properties at item level, referenced, and taking up much of Supplementary Information S2 Appendix, addresses the fullest justification of continuous measurement, and the declaration about that need not be labored in main text. With categorical measurement the concept would not arise. Concerning multivariate normality (MVN), we had of course satisfied ourselves that the distributions were generally appropriate for the modelling. This was done in four now traditional ways to assure that distribution issues did not undermine any conclusions drawn: (a) graphic item raw distributions, (b) plotting residuals from component multiple regressions with single factor components of the overall model (which are at least true residuals), (c) habitually bootstrapping (BS), and also reporting the BS p-values for any normal-assumption p-values that are >0.01, or in appropriate instances for those that are in a magnitude range pre-defined as marginal; and (d) noting pairs of variables (items) with high absolute residuals, although this last is more about degrading potentially good fit (false-negative), rather than resisting spuriously good fit (false positive). We had not thought it appropriate to quote multivariate normality statistics, and the necessarily long following reasoning shows why not. Normality bears chiefly on the literal interpretation of p-values, something best played down anyway since the reforms of the 1980s and the 2010s. There are other important issues beyond apparent fit in model acceptance such as the df-ratio (which only the Akaike Information Criterion, by giving delta = (default minus saturated), well expresses – and we use that, but there was not space to emphasise specifically why; also, whether spending df on estimating intercepts and means was done to help handle missing data. We did not specifically report that decision or its reasoning, nor specify the– the ML estimator employed– again a somewhat default matter – etc etc. With so many detailed options, many not differing materially for a given set of data, it is not always clear where one should stop in reporting procedural detail and probably investigators with differing recent experiences would differ in their preferences. Nevertheless, we are very glad to have received this question on MVN as stimulus to formalize our approach. Immediately, we state the relevant action in the added paragraph to be found at the boundary between pages 9 and 10, lines 208-218. Some consequential minor refinements have contributed to the changes made to the Supplementary Information S2 Appendix. We think the reviewer is chiefly concerned about the explicit reporting of techniques used, so have tried to keep the following explanation brief. In short, simply stating whether MVN assumptions are ‘satisfied’ via cut-offs on two parameters (skewnesses, kurtoses – typically reduced to one p-value) could potentially be misleading by generalising false beliefs about the relevant data, these errors being of either false-positive or false-negative type. There is a need for systematic steps when multivariate normality is not safely met, but also if it is marginal. The steps should be ‘strategically’ guided and preferably pre-declared, but as they have also to respond to the obtained data, that ideal is not easily met, and their reporting does not sit easily under the introductory ‘strategy’ or other sub-section of Methods but will require explicit sections on preliminary analyses. This broader approach is required because of technical issues over MVN itself: the lack of consistent guidance on what value of Mardia index is to be taken as safe, marginal or dangerous; the software facilities using or giving access to raw variables rather than true residuals, and possible false assurance -- the fact that an overall index may conceal (‘false negative’) local violations of normality bearing upon the interpretation of local features. In the light of the complexity of appropriate use of MVN information, when a colleague familiar with the package R-Lavaan returned to work recently, we asked her to implement our simple and bi-factor CFAs. This was as further check on our declared approach. To summarise: (a) the results are reproduced for both models on RMSEA and CFI indices to within 5 in the 3rd decimal place (that is, one half of one percent), and similarly on the p-values, particularly for the marginal links in the Hearing factor meriting this close attention; (b) Lavaan confirmed that the bi-factor model, whilst fitting much better, is also the more kurtotic; (c) in neither SPSS-AMOS nor R-Lavaan does the offered MVN index describe the residuals, but the raw variables; (d) in neither package is it easy to access the residuals for doing one’s own normality analyses, multi- or uni-variate. Shortly before our resubmission date she showed that residuals can be saved for further analysis and so accessed by the programming flexibility in R-Lavaan, but good guidance notes on the need and the process need to be provided. (e) On doing this, the MV skew became stronger than for raw data in the present dataset, but this is of less concern; the Mardia kurtosis that was highly significant in the raw values, though still positive (1.3) became NS (p=0.19) on the residuals. At time of resubmission these comparisons are ongoing with a view to the statistical applications note mentioned, so we have not felt it appropriate to modify the article to claim virtue of multi-versed checks by using more than one package. The challenge of this complexity and need for authors to nuance and devise an appropriate way to achieve economy in reporting is to produce just a few comprehensible sentences that are meaningful in the light of the complex reality. Our offering on pages 9-10 may still appear somewhat cryptic for the general reader, as much of the scientific reasoning behind what is appropriate has had for brevity to be excluded. However, we trust that it satisfies the reviewer’s emphasis on the need for comprehensive reporting; those of high statistical literacy might be in a position to judge appropriateness of our approach, ie concentrating various types of checking on where genuine issues of marginality for link retention might compromise conclusions. 3. Looks like data were generated for multiple time-points/repeated measures; so, why was some (repeated-measures) ANOVA-type approach not conducted, in addition? Some mixed-model approach would also have been appropriate; why was that not done? I was looking for some clarification The short answer is: not every aspect of a study’s data structure need appear in analyses in a single article and the article is already full. Indeed, it encourages general transparency to (briefly) declare other related data that are held but not analysed in the current paper, because not related to the scientific question, and this we did. We have now added a few further words of clarification to this effect. An article on the time/treatment aspect was submitted at around the same time to another journal and is now accepted there. The reason we have not used repeated measures or other versions of ANOVA is that those techniques are for comparing mean effects (eg treatment) over time against background variability (individual and other) to assess the reliability and magnitude of change. The present article is primarily psychometric, so is not about such mean comparisons, shifts, effects etc, as in the later treatment aspect, but is about factor structure and appropriate scoring, using the more copious baseline data, and primarily the correlations between scores and available criterion measures. Later-visit data could at some complexity-cost also be used in development of measures (eg if manifest change led to concern that the measurement model should optimally span the periods from which measurements would be used); but, it is conventional to not do that, but rather to accept the structural account from baseline as defining the psychometrics. This is done because of sample attrition: first visit usually has the largest N of all the time-points and so best guarantees reliability, generality and the crucial stationarity of measurement, if perhaps not its optimality for capturing change. We share the enthusiasm for fitting categorical independent variables, where sampled, as random effects, eg by using mixed models; however the purpose and design here do not call for random-effects analyses either. We trust that these changes and the general clarifications of wording around them give the paper the appeal to the readership in the balanced way at which we must all aim. Submitted filename: Response to reviewers.docx Click here for additional data file. 30 Aug 2022 Can short PROMs support valid factor-based sub-scores? Example of COMQ-12 in chronic otitis media PONE-D-22-07528R1 Dear Dr. Bukurov, We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements. Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication. An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org. If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org. Kind regards, Claudio Andaloro Academic Editor PLOS ONE Additional Editor Comments (optional): Reviewers' comments: Reviewer's Responses to Questions Comments to the Author 1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation. Reviewer #1: All comments have been addressed Reviewer #2: All comments have been addressed Reviewer #3: All comments have been addressed ********** 2. Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: (No Response) ********** 3. Has the statistical analysis been performed appropriately and rigorously? Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: (No Response) ********** 4. Have the authors made all data underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: (No Response) ********** 5. Is the manuscript presented in an intelligible fashion and written in standard English? PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here. Reviewer #1: Yes Reviewer #2: Yes Reviewer #3: (No Response) ********** 6. Review Comments to the Author Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Thanks for the clarifications. I have no further concerns. Congratulations for the authors for the manuscript. Reviewer #2: The revisions carried out are satisfactory and have suitably addressed all the queries raised during the review process. Reviewer #3: (No Response) ********** 7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: Yes: Rafael da Costa Monsanto Reviewer #2: Yes: Mainak Dutta Reviewer #3: No ********** 20 Sep 2022 PONE-D-22-07528R1 Can short PROMs support valid factor-based sub-scores? Example of COMQ-12 in chronic otitis media Dear Dr. Bukurov: I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department. If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org. If we can help with anything else, please email us at plosone@plos.org. Thank you for submitting your work to PLOS ONE and supporting open access. Kind regards, PLOS ONE Editorial Office Staff on behalf of Dr. Claudio Andaloro Academic Editor PLOS ONE

33 in total

1. Improving the reporting quality of nonrandomized evaluations of behavioral and public health interventions: the TREND statement.

Authors: Don C Des Jarlais; Cynthia Lyles; Nicole Crepaz
Journal: Am J Public Health Date: 2004-03 Impact factor: 9.308

2. Quality of life in children with chronic suppurative otitis media with or without cholesteatoma.

Authors: I M Vlastos; D Kandiloros; L Manolopoulos; E Ferekidis; I Yiotakis
Journal: Int J Pediatr Otorhinolaryngol Date: 2008-12-23 Impact factor: 1.675

Review 3. Bifactor and Hierarchical Models: Specification, Inference, and Interpretation.

Authors: Kristian E Markon
Journal: Annu Rev Clin Psychol Date: 2019-01-16 Impact factor: 18.561

4. Arabic Cross-Cultural Adaptation and Validation of Health-Related Quality of Life Measures for Chronic Otitis Media (COMQ-12).

Authors: Saad Elzayat; Haitham H Elfarargy; Mahmoud Mandour; Rasha Lotfy; Maurizio Barbara
Journal: Otol Neurotol Date: 2021-07-01 Impact factor: 2.311

5. Evaluation of the relation between audiometric and psychometric measures of hearing after tympanoplasty.

Authors: Astrid G W Korsten-Meijer; Hero P Wit; Frans W J Albers
Journal: Eur Arch Otorhinolaryngol Date: 2005-11-03 Impact factor: 2.503

6. Mental health and quality of life in patients with chronic otitis media.

Authors: Salih Bakir; Vefa Kinis; Yasin Bez; Ramazan Gun; Ediz Yorgancilar; Musa Ozbay; Bülent Aguloglu; Faruk Meric
Journal: Eur Arch Otorhinolaryngol Date: 2012-05-08 Impact factor: 2.503

7. Validation of a Spanish version of the health-related quality of life (HRQoL) measure for Chronic Otitis Media (COMQ-12).

Authors: Ana M Otoya-Tono; Lucía C Pérez-Herrera; Daniel Peñaranda; Sergio Moreno-López; Ricardo Sánchez-Pedraza; Juan Manuel García; John S Phillips; Augusto Peñaranda
Journal: Health Qual Life Outcomes Date: 2020-11-10 Impact factor: 3.186

8. Psychometric characteristics of the chronic Otitis media questionnaire 12 (COMQ - 12): stability of factor structure and replicability shown by the Serbian version.

Authors: Bojana Bukurov; Nenad Arsovic; Sandra Sipetic Grujicic; Mark Haggard; Helen Spencer; Jelena Eric Marinkovic
Journal: Health Qual Life Outcomes Date: 2017-10-23 Impact factor: 3.186

9. Development and Administration of Chronic Suppurative Otitis Media Questionnaire-12 (COMQ-12) and Chronic Otitis Media Outcome Test-15 (COMOT-15) in Kannada.

Authors: Prashanth Prabhu; Anusha Chandrashekar; Anita Jose; Aishwarya Ganeshan; Lavanya Kiruthika
Journal: Int Arch Otorhinolaryngol Date: 2017-06-05

Review 10. Reporting quality of randomized controlled trials in otolaryngology: review of adherence to the CONSORT statement.

Authors: Yu Qing Huang; Katsiaryna Traore; Badr Ibrahim; Maida J Sewitch; Lily H P Nguyen
Journal: J Otolaryngol Head Neck Surg Date: 2018-05-15