Literature DB >> 31761995

Measurement Properties of Commonly Used Generic Preference-Based Measures in East and South-East Asia: A Systematic Review.

Xinyu Qian¹, Rachel Lee-Yin Tan¹, Ling-Hsiang Chuang², Nan Luo³.

Abstract

OBJECTIVES: Our aim was to systematically review published evidence on the construct validity, test-retest reliability and responsiveness of generic preference-based measures (PBMs) used in East and South-East Asia.
METHODS: This systematic review was guided by the COSMIN guideline. A literature search on the MEDLINE, EMBASE, PsycINFO and PubMed databases up to August 2019 was conducted for measurement properties validation papers of the EuroQol-5 Dimensions (EQ-5D), Short Form-6 Dimensions (SF-6D), Health Utilities Index (HUI), Quality of Well-Being (QWB), 15-Dimensional (15D) and Assessment of Quality of Life (AQOL) in East and South-East Asian countries. Included papers were disaggregated into individual studies whose results and quality of design were rated separately. The population-specific measurement properties (construct validity, test-retest reliability and responsiveness) of each PBM were assessed separately using relevant studies. The overall methodological quality of the studies used in each of the assessments was also rated.
RESULTS: A total of 79 papers containing 1504 studies were included in this systematic review. The methodological quality was 'very good' or 'adequate' for the majority of the construct validity studies (99%) and responsiveness studies (61%), but for only a small portion of the test-retest reliability studies (23%). EQ-5D was most widely assessed and was found to have 'sufficient' construct validity and responsiveness in many populations, while the SF-6D and EuroQol-Visual Analog Scale (EQ-VAS) exhibited 'inconsistent' construct validity in some populations. Scarce evidence was available on HUI and QWB, but current evidence supported the use of HUI.
CONCLUSIONS: This systematic review provides a summary of the quality of existing generic PBMs in Asian populations. The current evidence supports the use of EQ-5D as the preferred choice when a generic PBM is needed, and continuous testing of all PBMs in the region.

Entities: Chemical Disease Gene Species

Year: 2020 PMID： 31761995 PMCID： PMC7081654 DOI： 10.1007/s40273-019-00854-w

Source DB: PubMed Journal: Pharmacoeconomics ISSN： 1170-7690 Impact factor: 4.981

Key Points for Decision Makers

Introduction

Preference-based measures (PBMs) provide a convenient approach to deriving health state values for the calculation of quality-adjusted life-years (QALYs) in cost-utility analysis [1]. The use of a PBM starts with describing health status or health-related quality of life (HRQoL) of individuals using a standardized questionnaire. The HRQoL data can then be converted into health state values using a scoring method (also known as a ‘value set’). The value sets are established using the health preferences of the general public for the health states described by the PBMs. All PBMs use a scale anchored by 0 (corresponding to dead) and 1 (corresponding to full or perfect health), with or without negative values for very poor health states. PBMs are usually developed for use in one population or culture, and subsequently introduced to other populations after translation or cultural adaptation. Since cultural, environmental, and psychosocial factors may affect the performance of PBMs, the measurement properties of PBMs should be validated in all populations and cultures to which they are introduced. Measurement properties that are relevant to all PBMs include construct validity, test-retest reliability, and responsiveness [2, 3]. In psychometrics, construct validity refers to the extent to which a scale measures what it is supposed to measure, test-retest reliability refers to the ability of a scale to generate reproducible measurement results, and responsiveness or sensitivity to change refers to the ability of a scale to capture the change in the levels of the targeted construct [3]. The testing of all three measurement properties involves collecting individual-level data using the scale, and performing statistical analyses. Construct validity is usually assessed through hypothesis testing because of the absence of a ‘gold standard’ measure [3]. Typically, the hypotheses are that a scale should be correlated with another scale measuring a similar construct (i.e. convergent validity) or that measurement results for groups known to differ in certain characteristics should be different (known-groups validity). The more hypotheses fulfilled, the more likely a scale is valid [3]. Test-retest reliability is assessed by examining the agreement between two different measurements of the same group of individuals whose levels in the targeted construct are the same at the times of the two measurements. Depending on the nature of the scale, statistics such as intraclass correlation coefficient (ICC) can be used as the indicator of test-retest reliability. Responsiveness assessment requires longitudinal data collection of individuals whose levels of the targeted construct change over time. Statistics that can be used to indicate responsiveness include standardized effect size (SES), standardized response mean [3], and receiver operating characteristic analysis [4]. Designed for use in a wide range of therapeutic areas, generic PBMs are particularly useful in economic evaluations informing resource allocations. In the past decades, generic PBMs such as EuroQol-5 Dimensions (EQ-5D) [5] and Short Form-6 Dimensions (SF-6D) [6] have been increasingly used in Asian countries and many validation studies assessing their measurement properties in Asian populations have been published. However, the overall performance of PBMs in different countries or patient populations in this region is unknown. This is an important knowledge gap since cost-utility analysis is increasingly used to inform reimbursement decision making in Asia [7, 8]. The aim of this systematic review was to review and summarize the current evidence on the measurement properties of generic PBMs in Asian populations.

Methods

The COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) guideline for systematic reviews of outcome measurement instruments [4] was used to guide this review. Different from systematic review guidelines that are designed to evaluate interventional studies (e.g. the Cochrane guideline), the COSMIN guideline is specialized for evaluating measurement properties that are usually assessed in observational studies. It provides methods and tools for use in the entire process of systematic reviews, including literature search, selection and evaluation of studies, interpretation of results, and reporting of findings. In this review, two members of the review team worked independently through all phases of the review, and discrepancies were resolved via consensus meetings with the other two members of the review team. The four phases of the review process are described below.

Identification and Selection of Studies

The search was carried out using online databases, including MEDLINE (OvidSP), EMBASE (OvidSP), PsycINFO (OvidSP), and PubMed, in August 2019. Three groups of search terms were included to describe: (1) country/district, including countries/districts in South-East and East Asia: ‘China’, ‘Korea’, ‘Japan’, ‘Singapore’, ‘Taiwan’, ‘Hong Kong’, ‘Indonesia’, ‘Malaysia’, ‘Philippine’, ‘Thailand’ and ‘Vietnam’; (2) PBMs of interest, including ‘EQ-5D-3L’, ‘EQ-5D-5L’, ‘EQ-VAS’, ‘SF-6D’, ‘HUI2’, ‘HUI3’,’QWB’, ‘15D’, and ‘AQOL’; and (3) measurement properties, including ‘construct validity’, ‘test-retest reliability’ and ‘responsiveness’. All spelling variations, acronyms and related terms were included in the search algorithm (Appendix 1 of Supplementary file). The search filter developed by Terwee et al. [9] for the identification of reports on measurement properties of measurement instruments was adapted for use in this review. Although the EuroQol-Visual Analog Scale (EQ-VAS) is not a PBM, it was included as it is a part of EQ-5D. A set of predefined selection criteria were applied to the hits that were generated by the search terms. Papers that examined the construct validity, test-retest reliability, and/or responsiveness of any PBMs in any countries/districts of interest were included. Original research using primary data such as interventional and observational studies were included. Secondary research, including reviews, were excluded. Reports on mapping or reports published in a non-English language, as well as commentaries or conference papers (i.e. abstracts) were also excluded.

Data Extraction

The COSMIN guideline differentiates papers and studies [4]. Each hypothesis tested, ICC, or SES value reported for assessing construct validity, test-retest reliability, and responsiveness, respectively, is treated as one study. Therefore, a paper can include more than one study. Information extracted from each study included PBM, sampling country or district, medical condition of study subjects, sample size, sample mean age, sample sex distribution, language of administration, and study design and result (see the following sections for more detail).

Assessment of Individual Studies

Each study was graded for its result and methodological quality using the methods prescribed in COSMIN [4]. The methods are briefly described below. The result for construct validity was graded based on whether or not it was congruent with a relevant hypothesis formulated by the review team. COSMIN recommends systematic review teams to formulate a set of hypotheses for assessing known groups and convergent validity (including direction and magnitude of correlations) [4]. This is to ensure that results from all studies included in the review are interpreted using the same criteria. In this review, the review team formulated hypotheses based on published papers and on their expert experience. Example hypotheses were ‘patients with worse symptoms would have lower PBM scores’ (for testing known-groups validity) and ‘PBM and Health Assessment Questionnaire (HAQ) scores would be negatively and strongly correlated’ (for testing convergent validity). If the results of a study support the relevant hypothesis, a ‘positive’ rating is given, otherwise, a ‘negative’ rating is given. Reported results on test-retest reliability (i.e. ICC value) were graded using 0.7 as the threshold [4]. A ‘positive’ rating was given if the ICC value was ≥ 0.70, otherwise a ‘negative’ rating was given. Although area under the curve (AUC) is recommended for assessing responsiveness by COSMIN, the review team used SES because all studies assessing responsiveness included in this review reported either only SES or results that could be used to calculate SES; only one study reported AUC and SES. An SES value below 0.20 has been interpreted as negligible [3, 10]. The review team assigned studies reporting an SES value < 0.20 a ‘negative’ rating, and those with an SES value ≥ 0.20 were assigned a ‘positive’ rating. Using the ‘Risk of Bias’ assessment tool, the methodological quality of all studies was rated as ‘very good’, ‘adequate’, ‘doubtful’, or ‘inadequate’ [4]. Different standards were used to assess studies of convergent validity, known-groups validity, test-retest reliability, and responsiveness. These standards targeted various aspects of the design and execution of the studies. For example, measurement properties of the comparator instrument were targeted for assessing studies of convergent validity; characteristics of the comparison groups were targeted for assessing studies of known-groups validity; and stability of patients, time interval between test and retest, and similarity between test conditions were targeted for assessing studies of test-retest reliability. All assessments were made according to COSMIN recommendations, except for one of the standards for assessing convergent validity studies and the standards for assessing responsiveness studies (the modified standards used are shown in Appendices 2 and 3 of Supplementary file).

Assessment of the Preference-Based Measures (PBMs)

Since measurement properties may vary across populations, the review team assessed the measurement properties of each PBM in different populations separately. In this review, EQ-5D-3L and EQ-5D-5L were treated as one PBM (i.e. EQ-5D), Health Utilities Index (HUI) 2 and HUI3 as HUI, and SF-6Ds derived from SF-12, SF-36, and its descriptive system were not examined separately. For each PBM, different language versions or modes of administration (i.e. self- and interviewer-administered) were not examined separately. The populations were defined first by country/district and then by disease group. The disease groups were defined by the primary medical conditions of study samples included in this review using the International Classification of Diseases, 11th Revision (ICD-11) [11]. Studies on the general population were treated as one group. For each PBM, separate assessments were performed using relevant studies to evaluate its population-specific measurement properties. Each of the assessments had two components—the measurement property and the quality of the evidence used in the assessment. The measurement property was rated as ‘sufficient’ (if at least 75% of the relevant studies had a ‘positive’ rating), ‘inconsistent’ (if 25–74% of the relevant studies had a ‘positive’ rating), or ‘insufficient’ (if < 25% of the relevant studies had a ‘positive’ rating) [4]. Using the COSMIN Grading of Recommendation Assessment, Development, and Evaluation (GRADE), the quality of evidence was rated as ‘high’, ‘moderate’, low’, or ‘very low’. To determine the grade for quality of evidence, the review team first assigned a rating of ‘high’ and then downgraded the rating based on the methodological quality of included studies (i.e. the ‘Risk of Bias’ factor) and the sample sizes of the studies (i.e. the ‘Imprecision’ factor). The review team did not apply the ‘Inconsistency’ and ‘Indirectness’ downgrading factors, as recommended by COSMIN [4]. In this review, inconsistency in the characteristics of the study samples was resolved by summarizing the results separately for different populations, and inconsistency in results was used to grade the quality of the PBMs. ‘Indirectness’ was not used as a downgrading factor because only studies of the populations of interest to the review team (i.e. populations from East and South-East Asia) were included (the modified GRADE criteria can be found in Appendices 4 and 5 of Supplementary file).

Results

The search initially identified a total of 1710 papers from four databases, which was reduced to 735 upon removal of duplicates, and further reduced to 114 after assessment of titles and abstracts. After assessment of full-text, 79 papers were retained for this systematic review [12-90]. A Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram for the selection process is shown in Fig. 1.

Fig. 1

Chart for search results and selection of papers, PROMs patient-reported outcome measures

Chart for search results and selection of papers, PROMs patient-reported outcome measures A total of 1504 individual studies were identified from the 79 retained papers. Table 1 shows the numbers of included papers and studies, organized by measurement property, PBM, and population. EQ-5D was the most studied PBM, construct validity was the most studied measurement property, Singapore and China produced the largest amount of papers, and the general population was the most studied. No relevant studies were found for Assessment of Quality of Life (AQOL), 15-Dimensional (15D) or Phillipines. A more detailed breakdown regarding the distribution of the papers can be found in Appendices 6 and 7 of Supplementary file.

Table 1

Included papers and studies, by category

	No. of papers/studies
Measurement property
Construct validity	73/1363
Test-retest reliability	25/61
Responsiveness	16/80
PBM
EQ-5D-3L	46/498
EQ-5D-5L	28/311
EQ-VAS	37/405
SF-6D	20/197
HUI2	2/16
HUI3	6/55
QWB	2/22
Country/district
China	19/376
Hong Kong	10/177
Japan	5/38
Malaysia	4/21
Singapore	19/374
South Korea	7/159
Taiwan	6/184
Thailand	6/146
Vietnam	1/12
Indonesia	2/17
Disease groups
Cancer	10/225
Developmental disease	1/14
Diabetes	5/56
Eye disease	3/32
Gastric disease	1/6
General population	17/302
Genitourinary disease	1/24
Heart disease	2/47
Hepatitis	2/31
HIV	3/39
Injury	1/60
Kidney disease	2/15
Mental disorders	3/65
Multiple conditions	3/130
Musculoskeletal disease	6/113
Neurological disease	3/78
Respiratory disease	3/32
Rheumatic disease	9/150
Skin disease	1/2
Stroke	3/71
Thyroid disease	1/12

PBM preference-based measures, EQ-5D-3L EuroQol-5 Dimensions, 3-Level Version, EQ-5D-5L EuroQol-5 Dimensions, 5-Level Version, EQ-VAS EuroQol-Visual Analog Scale, SF-6D Short Form-6 Dimensions, HUI Health Utilities Index, QWB Quality of Well-Being

Included papers and studies, by category PBM preference-based measures, EQ-5D-3L EuroQol-5 Dimensions, 3-Level Version, EQ-5D-5L EuroQol-5 Dimensions, 5-Level Version, EQ-VAS EuroQol-Visual Analog Scale, SF-6D Short Form-6 Dimensions, HUI Health Utilities Index, QWB Quality of Well-Being Results were ‘positive’ in 80% of construct validity studies, 79% of test-retest reliability studies, and 57% of responsiveness studies. While 99% of the construct validity studies and 61% of the responsiveness studies were rated to have ‘very good’ or ’adequate’ methodological quality, only a small portion of test-retest reliability studies (23%) achieved ‘very good’ or ‘adequate’ methodological quality. A total of 729, 38, and 42 studies assessing construct validity, test-retest reliability, and responsiveness of EQ-5D, respectively, were identified. EQ-5D-3L was more commonly studied than EQ-5D-5L. For example, EQ-5D-3L had more than twice the number of studies reported for construct validity than EQ-5D-5L. The results for EQ-5D are summarized in Table 2. ‘Sufficient’ construct validity exhibits in 6 of 10 countries/districts and 17 of 20 disease groups assessed; ‘sufficient’ test-retest reliability exhibits in none of 8 countries/districts and 3 of 10 disease groups assessed; and ‘sufficient’ responsiveness exhibits in 5 of 6 countries/districts and 8 of 11 disease groups assessed.

Table 2

Grading results for EQ-5D in different countries/districts and different disease groups

	Quality of PBM, quality of evidence, and references
	Construct validity			Test-retest reliability			Responsiveness
China	+	H	[21, 27, 28, 41–43, 45, 57, 72, 75, 76, 82, 83, 85, 87, 89, 90]	±	H	[27, 41, 75]
Hong Kong	+	H	[17, 18, 20, 70, 79]	±	L^b	[20, 70, 79]	+	V^b,c	[19]
Japan	±	M^a	[56, 66, 67]				+	H	[56]
Malaysia	+	H	[53, 65, 71]	−	V^b,c	[53]
Singapore	+	H	[12–14, 38, 40, 47, 50, 68, 74, 77, 84, 86]	±	L^b	[12, 37, 38, 47]	±	H	[12–14, 38, 51, 68]
South Korea	+	H	[29–32, 34, 39]	±	L^b	[29–32, 34]	+	H	[29, 33]
Taiwan	±	H	[15, 16, 26, 36, 44, 88]	±	M^a	[15, 36]	+	M^a	[16, 26, 44]
Thailand	+	H	[35, 58, 60–63]	±	L^b	[58, 63]	+	M^a	[60, 63]
Vietnam	±	H	[69]
Indonesia	±	H	[59]	−	L^b	[59]
Cancer	+	H	[30, 32, 34, 36, 38, 42, 67]	±	L^b	[30, 32, 34, 36–38]	±	M^c	[38]
Diabetes	+	H	[39, 57, 58, 74]	±	L^b	[58]
Eye disease	+	H	[13, 14]				±	H	[13, 24]
Gastric disease	+	H	[53]	–	V^b,c	[53]
General population	+	H	[15, 28, 31, 35, 41, 45, 59, 65, 66, 71, 75, 83, 85, 88, 89]	±	M^a	[15, 31, 41, 59, 75]
Genitourinary disease	±	H	[90]
Heart disease	+	H	[60, 82]				+	M^c	[60]
Hepatitis	+	L^b	[27]	+	H	[27]
HIV	+	H	[62, 69, 72]
Injury	+	H	[26]				+	L^b	[26]
Kidney disease	+	H	[86]
Mental disorders	±	H	[12, 68]	±	L^b	[12]	±	M^a	[12, 68]
Multiple conditions	+	H	[61, 63, 77]	+	L^b	[63]	+	L^b	[63]
Musculoskeletal disease	+	H	[18, 20, 21, 87]	+	L^b	[20]	+	V^b.c	[19]
Neurological disease	+	H	[50]				+	L^d	[51]
Respiratory disease	+	M^a	[17, 56]				+	H	[56]
Rheumatic disease	+	H	[29, 40, 47, 70, 76, 84]	±	L^b	[29, 47, 70]	+	M^c	[29]
Skin disease	+	H	[43]
Stroke	±	H	[16, 44]				+	H	[16, 33, 44]
Thyroid disease	+	H	[79]	–	L^b	[79]

Quality of PBM: + indicates sufficient results; ± indicates inconsistent results; − indicates insufficient results

Italicised font indicates that grading is based on no more than three studies

Quality of evidence: H indicates high; M indicates moderate; L indicates low; V indicates very low

EQ-5D EuroQol-5 Dimensions, PBM preference-based measure, ROB risk of bias

aQuality downgraded by 1 level due to ROB

bQuality downgraded by 2 levels due to ROB

cQuality downgraded by 1 level due to imprecision

dQuality downgraded by 2 levels due to imprecision

Grading results for EQ-5D in different countries/districts and different disease groups Quality of PBM: + indicates sufficient results; ± indicates inconsistent results; − indicates insufficient results Italicised font indicates that grading is based on no more than three studies Quality of evidence: H indicates high; M indicates moderate; L indicates low; V indicates very low EQ-5D EuroQol-5 Dimensions, PBM preference-based measure, ROB risk of bias aQuality downgraded by 1 level due to ROB bQuality downgraded by 2 levels due to ROB cQuality downgraded by 1 level due to imprecision dQuality downgraded by 2 levels due to imprecision A total of 374, 15, and 16 studies assessing construct validity, test-retest reliability, and responsiveness of EQ-VAS, respectively, were identified. The results for EQ-VAS are summarized in Table 3. ‘Sufficient’ construct validity exhibits in 5 of 10 countries/districts and 8 of 14 disease groups assessed; ‘sufficient’ test-retest reliability exhibits in 4 of 6 countries/districts and 3 of 5 disease groups assessed; and ‘sufficient’ responsiveness exhibits in all of 4 countries/districts and 6 of 7 disease groups assessed.

Table 3

Measurement properties of EQ-VAS in different countries/districts and disease groups

	Quality of PBM and evidence
	Construct validity			Test-retest reliability			Responsiveness
China	+	H	[21, 41, 72, 73, 75, 85]	+	M^a	[21, 41, 73, 75]
Hong Kong	+	H	[17, 18, 70]
Japan	±	M^c	[25]
Malaysia	±	H	[22, 53, 65]
Singapore	±	H	[13, 38, 46, 48–50, 77]	+	L^b	[38]	+	H	[13, 38, 51]
South Korea	±	H	[29–32]	−	L^b	[29, 30, 32]	+	M^c	[29]
Taiwan	±	H	[15, 16, 26, 36, 88]	+	H	[15]	+	L^b	[16, 26]
Thailand	+	H	[35, 61–63]	+	L^b	[63]	+	L^b	[63]
Vietnam	+	H	[69]
Indonesia	+	H	[59, 64]	±	M^a	[59, 64]
Cancer	±	H	[30, 32, 36, 38, 64]	±	L^b	[30, 32, 38, 64]	+	M^c	[38]
Diabetes	±	H	[46]
Eye disease	±	H	[13]				−	H	[13]
Gastric disease	+	H	[53]
General population	+	H	[15, 25, 31, 35, 41, 59, 65, 75, 85, 88]	+	M^a	[15, 41, 59, 75]
HIV	+	H	[62, 69, 72]
Injury	±	H	[26]				+	L^b	[26]
Kidney disease	+	H	[22]
Multiple conditions	+	H	[61, 63, 77]	+	L^b	[63]	+	L^b	[63]
Musculoskeletal disease	+	H	[18, 21]	−	M^c	[21]
Neurological disease	±	H	[50]				+	L^d	[51]
Respiratory disease	+	H	[17]
Rheumatic disease	+	H	[29, 48, 49, 70, 73]	+	L^b	[29, 73]	+	M^c	[29]
Stroke	−	H	[16]				+	M^c	[16]

Quality of PBM: + indicates sufficient results; ± indicates inconsistent results; − indicates insufficient results

Quality of evidence: H indicates high; M indicates moderate; L indicates low

Italicised font indicates that grading is based on no more than three studies

EQ-VAS EuroQol-Visual Analog Scale, PBM preference-based measure, ROB risk of bias

aQuality downgraded by 1 level due to ROB

bQuality downgraded by 2 levels due to ROB

cQuality downgraded by 1 level due to imprecision

dQuality downgraded by 2 levels due to imprecision

Measurement properties of EQ-VAS in different countries/districts and disease groups Quality of PBM: + indicates sufficient results; ± indicates inconsistent results; − indicates insufficient results Quality of evidence: H indicates high; M indicates moderate; L indicates low Italicised font indicates that grading is based on no more than three studies EQ-VAS EuroQol-Visual Analog Scale, PBM preference-based measure, ROB risk of bias aQuality downgraded by 1 level due to ROB bQuality downgraded by 2 levels due to ROB cQuality downgraded by 1 level due to imprecision dQuality downgraded by 2 levels due to imprecision A total of 179, 3, and 15 studies accessing construct validity, test-retest reliability, and responsiveness of SF-6D, respectively, were identified. The results for SF-6D are summarized in Table 4. ‘Sufficient’ construct validity exhibits in 2 of 5 countries/districts and 6 of 11 different disease groups assessed; ‘sufficient’ test-retest reliability exhibits in 1 (Hong Kong) of 2 countries/districts and 1 (thyroid) of 2 disease groups assessed; and ‘sufficient’ responsiveness exhibits in only one (South Korea) of 3 countries/districts and only 2 of 4 disease groups assessed.

Table 4

Measurement properties of SF-6D in different countries/districts and different disease groups

	Quality of PBM and evidence
	Construct validity			Test-retest reliability			Responsiveness
China	±	H	[28, 42, 82, 87, 89, 90]
Hong Kong	+	H	[17, 18, 78, 80]	+	L^a	[79]	±	L^a	[81]
Japan	–	H	[66]
Singapore	+	H	[12, 40, 84, 86]	±	V^a.b	[12]	±	L^a	[12, 24]
South Korea							+	H	[33]
Thailand	±	H	[61]
Cancer	+	H	[73]				±	L^a	[81]
Eye disease							+	L^a	[24]
General population	±	H	[28, 66, 80, 89]
Genitourinary disease	±	H	[90]
Heart disease	±	H	[82]
Hepatitis	±	H	[78]
Kidney disease	+	H	[86]
Mental disorders	+	H	[12]	±	V^a,b	[12]	±	L^a	[12]
Multiple conditions	±	H	[61]
Musculoskeletal disease	+	H	[18, 87]
Respiratory disease	+	H	[17]
Rheumatic disease	+	H	[40, 84]
Stroke							+	H	[33]
Thyroid disease				+	L^a	[79]

Quality of PBM: + indicates sufficient results; ± indicates inconsistent results; – indicates insufficient results

Italicised font indicates that grading is based on no more than three studies

SF-6D Short Form-6 Dimensions, PBM preference-based measure, ROB risk of bias

Quality of evidence: H indicates high; L indicates low; V indicates very low

aQuality downgraded by 2 levels due to ROB

bQuality downgraded by 1 level due to imprecision

Measurement properties of SF-6D in different countries/districts and different disease groups Quality of PBM: + indicates sufficient results; ± indicates inconsistent results; – indicates insufficient results Italicised font indicates that grading is based on no more than three studies SF-6D Short Form-6 Dimensions, PBM preference-based measure, ROB risk of bias Quality of evidence: H indicates high; L indicates low; V indicates very low aQuality downgraded by 2 levels due to ROB bQuality downgraded by 1 level due to imprecision A total of 59, 5, and 7 studies assessing construct validity, test-retest reliability, and responsiveness of HUI, respectively, were identified. The results for HUI are summarized in Table 5. ‘Sufficient’ construct validity exhibits in all 3 countries/districts and 4 disease groups assessed; ‘sufficient’ reliability exhibits in 1 (Thailand) of 2 countries/districts and 2 of 3 disease groups assessed; and ‘sufficient’ responsiveness exhibits in 1 (Thailand) of 2 countries/districts and 2 of 3 disease groups assessed.

Table 5

Measurement properties of HUI in different countries/districts and different disease groups

	Quality of PBM and evidence
	Construct validity			Test-retest reliability			Responsiveness
Hong Kong	+	H	[54]
Singapore	+	H	[12, 47, 52]	±	L^b	[12, 47]	±	M^a	[12, 24]
Thailand	+	H	[60]	+	H	[60]	+	H	[60]
Developmental disease	+	H	[54]
Eye disease							+	H	[24]
Heart disease	+	H	[60]	+	H	[60]	+	H	[60]
Mental disorders	+	H	[12, 52]	±	V^b.c	[12]	±	L^b	[12]
Rheumatic disease	+	H	[47]	+	V^b.c	[47]

Quality of PBM: + indicates sufficient results; ± indicates inconsistent results; − indicates insufficient results

Quality of evidence: H indicates high; M indicates moderate; L indicates low; V indicates very low

Italicised font indicates that grading is based on no more than three studies

HUI Health Utilities Index, PBM preference-based measure, ROB risk of bias

aQuality downgraded by 1 level due to ROB

bQuality downgraded by 2 levels due to ROB

cQuality downgraded by 1 level due to imprecision

Measurement properties of HUI in different countries/districts and different disease groups Quality of PBM: + indicates sufficient results; ± indicates inconsistent results; − indicates insufficient results Quality of evidence: H indicates high; M indicates moderate; L indicates low; V indicates very low Italicised font indicates that grading is based on no more than three studies HUI Health Utilities Index, PBM preference-based measure, ROB risk of bias aQuality downgraded by 1 level due to ROB bQuality downgraded by 2 levels due to ROB cQuality downgraded by 1 level due to imprecision A total of 22 studies assessing the construct validity of the Quality of Well-Being (QWB) scale were identified. ‘Sufficient’ construct validity exhibits in both China and Japan and both neurological and respiratory disease groups.

Discussion

This systematic review targets the measurement properties of generic PBMs in East and South-East Asian countries. To the best of the review team’s knowledge, this is the first systematic review of its kind. This review found that the generic PBMs that have been tested are EQ-5D, SF-6D, HUI (i.e. HUI2 and HUI3) and QWB, and that EQ-5D (i.e. EQ-5D-3L and EQ-5D-5L) might be the preferred choice when a generic PBM is needed in Asia. First, the evidence for EQ-5D is of the largest amount for all measurement properties and populations assessed. Second, it exhibited ‘sufficient’ construct validity and responsiveness in the largest number of populations, and ‘insufficient’ construct validity or responsiveness in none of the populations assessed. Satisfactory construct validity and responsiveness were also reported in past systematic reviews of EQ-5D in musculoskeletal [91], schizophrenia [92], skin [93], metabolic [94, 95], and respiratory diseases [96]. However, the current finding that EQ-5D is valid and responsive for patients with eye and heart diseases is at odds with the finding from a systematic review [95] that was mainly based on evidence from European populations. The contradictory findings from the two systematic reviews suggest that the measurement properties of PBMs might vary from region to region. Therefore, it might be worthwhile to perform similar reviews for other regions to better inform the selection of PBMs for use in different populations. The test-retest reliability of EQ-5D was found to be either ‘inconsistent’ or ‘insufficient’ for almost all populations, which is largely inconsistent with past systematic reviews [91, 94, 96]. The inferior test-retest reliability of EQ-5D revealed in this review could be related to suboptimal quality of evidence, which was attributable to the imperfect study design. In many studies included in this review, the ‘test’ was conducted when subjects visited a health institution, in the mode of face-to-face interview or self-completion, while the ‘retest’ was conducted over the telephone or via post when subjects were rested in their homes. The change in the data collection mode and setting from test to retest could have negatively affected the assessment result. Moreover, the test-retest reliability of EQ-5D could be underestimated due to the long duration used in those studies. Most studies included in this systematic review conducted the retest 1–2 weeks after the first test, as recommended [97]. While an interval of 1–2 weeks is appropriate for testing scales using a recall period of 1–4 weeks, it may be too long for EQ-5D because its recall period is only one day (‘today’). It is very possible that the health status of patients experiencing episodic symptoms in a particular day would change after 1 or 2 weeks, thus violating the assumption of unchanged health status needed for test-retest reliability testing, and leading to a worse test result. The results for EQ-VAS are not entirely surprising because a visual analogue scale is not as easy to understand or use as verbal or categorical rating scales, where each response option is attached to an explanatory label [98]. It is possible that Asians, on average, have more difficulty with the EQ-VAS than Westerners because of their relatively lower education levels [99]. The suboptimal construct validity could also be caused by the vagueness of the labels used by EQ-VAS. In a qualitative study of Asians from Singapore [100], great variations in the interpretation of ‘best imaginable health’ were observed, which casts doubt on the comparability of EQ-VAS scores across individuals. However, a ‘sufficient’ result on responsiveness suggests that the EQ-VAS can be useful in evaluating individual-level change in HRQoL. The suboptimal construct validity results for SF-6D are somewhat surprising. The descriptive system of SF-6D is more comprehensive than EQ-5D, and worldwide studies comparing SF-6D and EQ-5D found the two PBMs to have comparable measurement properties. One possible explanation can be due to elderly patients in Asia having a relatively lower literacy rate. According to UNESCO data [101], the elderly in European countries, such as Italy and Romania, have a literacy rate of > 85%. On the other hand, the literacy rate for the elderly in Asian countries, such as Thailand and Malaysia, is below 40%. The data collection for SF-6D is usually through SF-36, which contains 36 questions using relatively long sentence structures, which in turn might be difficult for some respondents with a lower literacy level [99]. This study provides some directions for future research on generic PBMs in Asia. First, future research should be expanded to rarely or never tested PBMs such as HUI, QWB, and AQOL. HUI (i.e. HUI2 and HUI3) is especially worth more research since ‘sufficient’ support has been shown for most measurement properties in all populations assessed. Second, researchers are strongly recommended to use a better design in future studies of test-retest reliability and responsiveness, such as using the same data collection mode in all time points. Last, studies should be conducted to ascertain the reasons for the suboptimal construct validity of the SF-6D and EQ-VAS, and to explore ways to improve their performance in Asian populations. This study has three limitations. First, since some of the COSMIN methods and tools do not apply to a systematic review of multiple measures in multiple populations, it was necessary for the review team to modify the original methods. Due to these modifications, it may not be meaningful to compare the results from this review with those from other reviews that applied the original COSMIN methods. These modifications, however, are unlikely to favour any of the PBMs included in this study. The second limitation is the exclusion of papers published in non-English journals due to limited manpower and resources. There are databases in the Chinese, Japanese, and Korean languages that could include validation studies of PBMs. Therefore, the results of this review might not truly reflect the performance of the generic PBMs in China, Japan, and South Korea. Third, since different language versions were not differentiated, results from this review for Singapore and Malaysia might not be accurate for all language versions of the studied instruments. Despite the effort that has been put into translation, psychometric equivalence between source and target languages might not necessarily occur [102]. Nevertheless, studies have shown measurement equivalence between different language versions of EQ-5D and SF-6D in Singapore [103-106].

Conclusions

This systematic review provides a summary of the quality of existing generic PBMs in Asian populations from different countries and different disease groups. The current evidence supports the use of EQ-5D as the preferred choice, when a generic PBM is needed, and the continuous testing of all PBMs in the region. Below is the link to the electronic supplementary material. Electronic supplementary material 1 (DOCX 40 kb)

Generic preference-based measures (PBMs) play an important role in health technology assessment in Asian countries.

The EuroQol-5 Dimensions (EQ-5D) has shown good construct validity and responsiveness in most countries and most disease groups in East and South-East Asia.

Future research should be expanded to rarely or never tested PBMs, such as the Health Utilities Index, Quality of Well-Being scale, and Assessment of Quality of Life instrument in this region.

96 in total

1. Validity and reliability of the EQ-5D self-report questionnaire in English-speaking Asian patients with rheumatic diseases in Singapore.

Authors: N Luo; L H Chew; K Y Fong; D R Koh; S C Ng; K H Yoon; S Vasoo; S C Li; J Thumboo
Journal: Qual Life Res Date: 2003-02 Impact factor: 4.147

Review 2. Application and measurement properties of EQ-5D to measure quality of life in patients with upper extremity orthopaedic disorders: a systematic literature review.

Authors: Cécile Grobet; Miriam Marks; Linda Tecklenburg; Laurent Audigé
Journal: Arch Orthop Trauma Surg Date: 2018-04-13 Impact factor: 3.067

3. A comparison of the reliability and validity of SF-6D, EQ-5D and HUI3 utility measures in patients with schizophrenia and patients with depression in Singapore.

Authors: Edimansyah Abdin; Siow Ann Chong; Esmond Seow; Chao Xu Peh; Jit Hui Tan; Jianlin Liu; Sophia Foo Si Hui; Boon Yiang Chua; Kang Sim; Swapna Verma; Janhavi Ajit Vaingankar; Mythily Subramaniam
Journal: Psychiatry Res Date: 2019-03-01 Impact factor: 3.222

4. Health state utilities and subjective well-being among psoriasis vulgaris patients in mainland China.

Authors: Liu Liu; Shunping Li; Yue Zhao; Jianglin Zhang; Gang Chen
Journal: Qual Life Res Date: 2018-02-28 Impact factor: 4.147

5. Taking stock of cost-effectiveness analysis of healthcare in China.

Authors: Thomas Butt; Gordon G Liu; David D Kim; Peter J Neumann
Journal: BMJ Glob Health Date: 2019-05-14

Review 6. EQ-5D in skin conditions: an assessment of validity and responsiveness.

Authors: Yaling Yang; John Brazier; Louise Longworth
Journal: Eur J Health Econ Date: 2014-10-31

7. Reliability and validity of the EQ-5D-3L for Kashin-Beck disease in China.

Authors: Hua Fang; Umer Farooq; Dimiao Wang; Fangfang Yu; Mohammad Imran Younus; Xiong Guo
Journal: Springerplus Date: 2016-11-07

8. EQ-5D-5L norms for the urban Chinese population in China.

Authors: Zhihao Yang; Jan Busschbach; Gordon Liu; Nan Luo
Journal: Health Qual Life Outcomes Date: 2018-11-08 Impact factor: 3.186

9. Validity and reliability of the EQ-5D-5 L in family caregivers of leukemia patients.

Authors: Limin Li; Chaojie Liu; Xiuzhi Cai; Hongjuan Yu; Xueyun Zeng; Mingjie Sui; Erwei Zheng; Yang Li; Jiao Xu; Jin Zhou; Weidong Huang
Journal: BMC Cancer Date: 2019-05-30 Impact factor: 4.430

10. Health-related quality of life in pregnant women living with HIV: a comparison of EQ-5D and SF-12.

Authors: Xiaowen Wang; Guangping Guo; Ling Zhou; Jiarui Zheng; Xiumin Liang; Zhanqin Li; Hongzhuan Luo; Yuyan Yang; Liyuan Yang; Ting Tan; Jun Yu; Lin Lu
Journal: Health Qual Life Outcomes Date: 2017-08-30 Impact factor: 3.186

6 in total

Review 1. A Review of Utility Measurement Methods Used in Pharmacoeconomic Submissions to HIRA in South Korea: Methodological Consistency and Areas for Improvement.

Authors: Jihyung Hong; Eun-Young Bae
Journal: Pharmacoeconomics Date: 2021-07-28 Impact factor: 4.981

2. Association Between Combined Lifestyle Factors and Healthy Ageing in Chinese Adults: The Singapore Chinese Health Study.

Authors: Yan-Feng Zhou; Xing-Yue Song; Xiong-Fei Pan; Lei Feng; Nan Luo; Jian-Min Yuan; An Pan; Woon-Puay Koh
Journal: J Gerontol A Biol Sci Med Sci Date: 2021-09-13 Impact factor: 6.053

Measurement Properties of Commonly Used Generic Preference-Based Measures in East and South-East Asia: A Systematic Review.

Key Points for Decision Makers

Introduction

Methods

Identification and Selection of Studies

Data Extraction

Assessment of Individual Studies

Assessment of the Preference-Based Measures (PBMs)

Results

Discussion

Conclusions

1. Validity and reliability of the EQ-5D self-report questionnaire in English-speaking Asian patients with rheumatic diseases in Singapore.

Review 2. Application and measurement properties of EQ-5D to measure quality of life in patients with upper extremity orthopaedic disorders: a systematic literature review.

3. A comparison of the reliability and validity of SF-6D, EQ-5D and HUI3 utility measures in patients with schizophrenia and patients with depression in Singapore.

4. Health state utilities and subjective well-being among psoriasis vulgaris patients in mainland China.

5. Taking stock of cost-effectiveness analysis of healthcare in China.

Review 6. EQ-5D in skin conditions: an assessment of validity and responsiveness.

7. Reliability and validity of the EQ-5D-3L for Kashin-Beck disease in China.

8. EQ-5D-5L norms for the urban Chinese population in China.

9. Validity and reliability of the EQ-5D-5 L in family caregivers of leukemia patients.

10. Health-related quality of life in pregnant women living with HIV: a comparison of EQ-5D and SF-12.

Review 1. A Review of Utility Measurement Methods Used in Pharmacoeconomic Submissions to HIRA in South Korea: Methodological Consistency and Areas for Improvement.

2. Association Between Combined Lifestyle Factors and Healthy Ageing in Chinese Adults: The Singapore Chinese Health Study.

3. Health-Related Quality of Life Based on EQ-5D Utility Score in Patients With Tuberculosis: A Systematic Review.

4. Psychometric Evaluation of the Chinese Version of the Decision Regret Scale.

5. The EQ-5D-5L Valuation Study in Egypt.

6. Estimation of an EORTC QLU-C10 Value Set for Spain Using a Discrete Choice Experiment.