| Literature DB >> 20860789 |
Lidwine B Mokkink1, Caroline B Terwee, Elizabeth Gibbons, Paul W Stratford, Jordi Alonso, Donald L Patrick, Dirk L Knol, Lex M Bouter, Henrica C W de Vet.
Abstract
BACKGROUND: The COSMIN checklist is a tool for evaluating the methodological quality of studies on measurement properties of health-related patient-reported outcomes. The aim of this study is to determine the inter-rater agreement and reliability of each item score of the COSMIN checklist (n = 114).Entities:
Mesh:
Year: 2010 PMID: 20860789 PMCID: PMC2957386 DOI: 10.1186/1471-2288-10-82
Source DB: PubMed Journal: BMC Med Res Methodol ISSN: 1471-2288 Impact factor: 4.615
Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) on whether the property was evaluated in an article (COSMIN step 1)
| percentage agreement | Intraclass kappaa | |
|---|---|---|
| Internal consistency | 0.66 | |
| Reliability | ||
| Measurement error | 0.02b | |
| Content validity | 0.29 | |
| Structural validity | 0.48 | |
| Hypotheses testing | 0.29 | |
| Cross-cultural validity | 0.66b | |
| Criterion validity | 0.23b | |
| Responsiveness | ||
| Interpretability | 0.02b |
a number of ratings on the 75 articles = 263; items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response category; printed in bold indicates kappa > 0.70 or % agreement >80%
Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COSMIN checklist (COSMIN step 3)
| Item nr | Item | N (minus articles with 1 rating)a | % agreement | N | Kappa |
|---|---|---|---|---|---|
| A1 | Does the scale consist of effect indicators, i.e. is it based on a reflective model? | 185 | 193 | 0.06 | |
| Design requirements | |||||
| A2c | Was the percentage of missing items given? | 183 | 190 | 0.48 | |
| A3c | Was there a description of how missing items were handled? | 180 | 187 | 0.54 | |
| A4 | Was the sample size included in the internal consistency analysis adequate? | 177 | 185 | 0.06d | |
| A5c | Was the unidimensionality of the scale checked? i.e. was factor analysis or IRT model applied? | 180 | 187 | 0.69 | |
| A6 | Was the sample size included in the unidimensionality analysis adequate? | 166 | 79 | 178 | 0.27 |
| A7 | Was an internal consistency statistic calculated for each (unidimensional) (sub)scale separately? | 179 | 187 | 0.31d | |
| A8c | Were there any important flaws in the design or methods of the study? | 174 | 179 | 0.22d | |
| Statistical methods | |||||
| A9 | for Classical Test Theory (CTT): Was Cronbach's alpha calculated? | 179 | 187 | 0.27d,e | |
| A10 | for dichotomous scores: Was Cronbach's alpha or KR-20 calculated? | 151 | 165 | 0.17d,e | |
| A11 | for IRT: Was a goodness of fit statistic at a global level calculated? e.g. χ2, reliability coefficient of estimated latent trait value (index of (subject or item) separation) | 154 | 167 | 0.46d,e | |
| Design requirements | |||||
| B1c | Was the percentage of missing items given? | 129 | 140 | 0.39 | |
| B2c | Was there a description of how missing items were handled? | 125 | 137 | 0.43d | |
| B3 | Was the sample size included in the analysis adequate? | 127 | 77 | 139 | 0.35 |
| B4c | Were at least two measurements available? | 129 | 140 | ||
| B5 | Were the administrations independent? | 129 | 73 | 139 | 0.18 |
| B6c | Was the time interval stated? | 125 | 136 | 0.50d | |
| B7 | Were patients stable in the interim period on the construct to be measured? | 126 | 75 | 138 | 0.24 |
| B8 | Was the time interval appropriate? | 125 | 137 | 0.45 | |
| B9 | Were the test conditions similar for both measurements? e.g. type of administration, environment, instructions | 127 | 138 | 0.30 | |
| B10c | Were there any important flaws in the design or methods of the study? | 117 | 77 | 129 | 0.08 |
| Statistical methods | |||||
| B11 | for continuous scores: Was an intraclass correlation coefficient (ICC) calculated? | 119 | 133 | 0.59e | |
| B12 | for dichotomous/nominal/ordinal scores: Was kappa calculated? | 111 | 127 | 0.32e | |
| B13 | for ordinal scores: Was a weighted kappa calculated? | 111 | 127 | 0.42e | |
| B14 | for ordinal scores: Was the weighting scheme described? e.g. linear, quadratic | 108 | 124 | 0.35e | |
| Design requirements | |||||
| D1 | Was there an assessment of whether all items refer to relevant aspects of the construct to be measured? | 62 | 79 | 83 | 0.33 |
| D2 | Was there an assessment of whether all items are relevant for the study population? (e.g. age, gender, disease characteristics, country, setting) | 62 | 76 | 83 | 0.46 |
| D3 | Was there an assessment of whether all items are relevant for the purpose of the measurement instrument? (discriminative, evaluative, and/or predictive) | 62 | 66 | 83 | 0.21 |
| D4 | Was there an assessment of whether all items together comprehensively reflect the construct to be measured? | 62 | 66 | 83 | 0.15 |
| D5c | Were there any important flaws in the design or methods of the study? | 58 | 76 | 78 | 0.13 |
| E1 | Does the scale consist of effect indicators, i.e. is it based on a reflective model? | 99 | 78 | 116 | 0f |
| Design requirements | |||||
| E2c | Was the percentage of missing items given? | 95 | 110 | 0.41 | |
| E3c | Was there a description of how missing items were handled? | 93 | 109 | 0.55 | |
| E4 | Was the sample size included in the analysis adequate? | 94 | 109 | 0.56d | |
| E5c | Were there any important flaws in the design or methods of the study? | 89 | 103 | 0.27 | |
| Statistical methods | |||||
| E6 | for CTT: Was exploratory or confirmatory factor analysis performed? | 92 | 106 | 0.51d,e | |
| E7 | for IRT: Were IRT tests for determining the (uni-) dimensionality of the items performed? | 62 | 80 | 0.39e,f | |
| Design requirements | |||||
| F1c | Was the percentage of missing items given? | 158 | 168 | 0.41 | |
| F2c | Was there a description of how missing items were handled? | 159 | 169 | 0.60d | |
| F3 | Was the sample size included in the analysis adequate? | 157 | 167 | 0.12d | |
| F4 | Were hypotheses regarding correlations or mean differences formulated a priori (i.e. before data collection)? | 158 | 74 | 168 | 0.42 |
| F5 | Was the expected direction of correlations or mean differences included in the hypotheses? | 159 | 75 | 169 | 0.26e |
| F6 | Was the expected absolute or relative magnitude of correlations or mean differences included in the hypotheses? | 159 | 168 | 0.29e | |
| F7c | for convergent validity: Was an adequate description provided of the comparator instrument(s)? | 125 | 136 | 0.30 | |
| F8c | for convergent validity: Were the measurement properties of the comparator instrument(s) adequately described? | 124 | 135 | 0.35 | |
| F9c | Were there any important flaws in the design or methods of the study? | 131 | 145 | 0.17 | |
| Statistical methods | |||||
| F10 | Were design and statistical methods adequate for the hypotheses to be tested? | 150 | 78 | 161 | 0.00d,e,f |
| Design requirements | |||||
| G1c | Was the percentage of missing items given? | 25 | 32 | 0.52 | |
| G2c | Was there a description of how missing items were handled? | 22 | 30 | 0.32 | |
| G3 | Was the sample size included in the analysis adequate? | 26 | 33 | 0.23 | |
| G4c | Were both the original language in which the HR-PRO instrument was developed, and the language in which the HR-PRO instrument was translated described? | 28 | 33 | 0.34d | |
| G5c | Was the expertise of the people involved in the translation process adequately described? e.g. expertise in the disease(s) involved, expertise in the construct to be measured, expertise in both languages | 28 | 33 | 0.46 | |
| G6 | Did the translators work independently from each other? | 28 | 33 | 0.61 | |
| G7 | Were items translated forward and backward? | 28 | 33 | ||
| G8c | Was there an adequate description of how differences between the original and translated versions were resolved? | 28 | 33 | 0.50 | |
| G9c | Was the translation reviewed by a committee (e.g. original developers)? | 25 | 31 | 0.56 | |
| G10c | Was the HR-PRO instrument pre-tested (e.g. cognitive interviews) to check interpretation, cultural relevance of the translation, and ease of comprehension? | 21 | 29 | 0.61 | |
| G11c | Was the sample used in the pre-test adequately described? | 28 | 79 | 32 | 0f |
| G12 | Were the samples similar for all characteristics except language and/or cultural background? | 26 | 31 | 0.41 | |
| G13c | Were there any important flaws in the design or methods of the study? | 26 | 31 | 0.42 | |
| Statistical methods | |||||
| G14 | for CTT: Was confirmatory factor analysis performed? | 27 | 74 | 32 | 0.03e,f |
| G15 | for IRT: Was differential item function (DIF) between language groups assessed? | 13 | 77 | 23 | 0.28e,f |
| Design requirements | |||||
| H1c | Was the percentage of missing items given? | 35 | 56 | 0.59d | |
| H2c | Was there a description of how missing items were handled? | 35 | 56 | ||
| H3 | Was the sample size included in the analysis adequate? | 35 | 69 | 54 | 0.06 |
| H4 | Can the criterion used or employed be considered as a reasonable 'gold standard'? | 37 | 62 | 57 | 0f |
| H5c | Were there any important flaws in the design or methods of the study? | 33 | 79 | 54 | 0.10 |
| Statistical methods | |||||
| H6 | for continuous scores: Were correlations, or the area under the receiver operating curve calculated? | 37 | 78 | 56 | 0.16e |
| H7 | for dichotomous scores: Were sensitivity and specificity determined? | 29 | 47 | 0.28e,f | |
| Design requirements | |||||
| I1c | Was the percentage of missing items given? | 71 | 76 | 0.14d | |
| I2c | Was there a description of how missing items were handled? | 73 | 77 | 0.36d | |
| I3 | Was the sample size included in the analysis adequate? | 72 | 72 | 76 | 0.40 |
| I4c | Was a longitudinal design with at least two measurement used? | 73 | 78 | ||
| I5c | Was the time interval stated? | 73 | 78 | 0.25d | |
| I6c | If anything occurred in the interim period (e.g. intervention, other relevant events), was it adequately described? | 72 | 78 | 75 | 0.17 |
| I7c | Was a proportion of the patients changed (i.e. improvement or deterioration)? | 70 | 73 | 0.32d | |
| Design requirements for hypotheses testing | |||||
| For constructs for which a gold standard was not available | |||||
| I8 | Were hypotheses about changes in scores formulated a priori (i.e. before data collection)? | 65 | 69 | 72 | 0.35 |
| I9 | Was the expected direction of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses? | 60 | 78 | 65 | 0.19e |
| I10 | Were the expected absolute or relative magnitude of correlations or mean differences of the change scores of HR-PRO instruments included in these hypotheses? | 61 | 66 | 0.05d,e | |
| I11c | Was an adequate description provided of the comparator instrument(s)? | 56 | 70 | 63 | 0f |
| I12c | Were the measurement properties of the comparator instrument(s) adequately described? | 56 | 63 | 0.06 | |
| I13c | Were there any important flaws in the design or methods of the study? | 63 | 71 | 68 | 0.03 |
| Statistical methods | |||||
| I14 | Were design and statistical methods adequate for the hypotheses to be tested? | 63 | 73 | 67 | 0.21e,f |
| Design requirements for comparison to a gold standard | |||||
| For constructs for which a gold standards was available: | |||||
| I15 | Can the criterion for change be considered as a reasonable 'gold standard'? | 21 | 67 | 28 | 0f |
| I16c | Were there any important flaws in the design or methods of the study? | 12 | 67 | 21 | 0f |
| Statistical methods | |||||
| I17 | for continuous scores: Were correlations between change scores, or the area under the Receiver Operator Curve (ROC) curve calculated? | 28 | 79 | 39 | 0.47e,f |
| I18 | for dichotomous scales: Were sensitivity and specificity (changed versus not changed) determined? | 28 | 79 | 37 | 0.15e |
| J1c | Was the percentage of missing items given? | 22 | 41 | ||
| J2c | Was there a description of how missing items were handled? | 21 | 76 | 41 | 0.19 |
| J3 | Was the sample size included in the analysis adequate? | 23 | 74 | 41 | 0f |
| J4c | Was the distribution of the (total) scores in the study sample described? | 23 | 74 | 41 | 0.08 |
| J5c | Was the percentage of the respondents who had the lowest possible (total) score described? | 20 | 40 | ||
| J6c | Was the percentage of the respondents who had the highest possible (total) score described? | 21 | 41 | ||
| J7c | Were scores and change scores (i.e. means and SD) presented for relevant (sub) groups? e.g. for normative groups, subgroups of patients, or the general population | 21 | 76 | 41 | 0.05 |
| J8c | Was the minimal important change (MIC) or the minimal important difference (MID) determined? | 19 | 40 | 0.26d | |
| J9c | Were there any important flaws in the design or methods of the study? | 21 | 71 | 41 | 0f |
a When calculating percentage agreement, articles that were only scored once on the particular item were not taken into account;b number of times a box was evaluated;c dichotomous item;d Items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response category;e Combined kappa coefficient calculated because of nominal response scale in a one-way design;f Negative variance component in the calculation of kappa was set at 0;g sample sizes of Generalisability box are much higher that other items, because scores of the items on the Generalisability box for all measurement properties were combined; printed in bold indicates Kappa > 0.70 or % agreement >80%.
Inter-rater agreement (percentage agreement) and reliability (kappa coefficients) of the items from the COSMIN checklist (COSMIN step 4)
| Item nr | Item | N (minus articles with 1 rating)a | % agreement | N | Kappa |
|---|---|---|---|---|---|
| c | |||||
| Was the sample in which the HR-PRO instruments was evaluated adequately described? In terms of: | |||||
| 1d | median or mean age (with standard deviation or range)? | 733 | 865 | 0.36 | |
| 2d | distribution of sex? | 735 | 863 | 0.38e | |
| 3 | important disease characteristics (e.g. severity, status, duration) and description of treatment? | 746 | 862 | 0.39f | |
| 4d | setting(s) in which the study was conducted? e.g. general population, primary care or hospital/rehabilitation care | 735 | 863 | 0.30e | |
| 5d | countries in which the study was conducted? | 733 | 861 | 0.40e | |
| 6d | language in which the HR-PRO instrument was evaluated? | 733 | 861 | 0.41e | |
| 7d | Was the method used to select patients adequately described? e.g. convenience, consecutive, or random | 729 | 857 | 0.40 | |
| 8 | Was the percentage of missing responses (response rate) acceptable? | 724 | 849 | 0.48 | |
a When calculating percentage agreement, articles that were only scored once on the particular item were not taken into account;b number of times a box was evaluated;c sample sizes of Generalisability box are much higher that other items, because scores of the items on the Generalisability box for all measurement properties were combined;d dichotomous item;e Items with low dispersal i.e. more than 75% of the raters who responded to an item rated the same response category;f Combined kappa coefficient calculated because of nominal response scale in a one-way design; printed in bold indicates Kappa > 0.70 or % agreement >80%.