Literature DB >> 33786328

Interrater and Intrarater Reliability of the Beighton Score: A Systematic Review.

Lauren N Bockhorn¹, Angelina M Vera¹, David Dong¹, Domenica A Delgado¹, Kevin E Varner¹, Joshua D Harris¹.

Abstract

BACKGROUND: The Beighton score is commonly used to assess the degree of hypermobility in patients with hypermobility spectrum disorder. Since proper diagnosis and treatment in this challenging patient population require valid, reliable, and responsive clinical assessments such as the Beighton score, studies must properly evaluate efficacy and effectiveness.
PURPOSE: To succinctly present a systematic review to determine the inter- and intrarater reliability of the Beighton score and the methodological quality of all analyzed studies for use in clinical applications. STUDY
DESIGN: Systematic review; Level of evidence, 3.
METHODS: A systematic review of the MEDLINE, Embase, CINAHL, and SPORTDiscus databases was performed. Studies that measured inter- or intrarater reliability of the Beighton score in humans with and without hypermobility were included. Non-English, animal, cadaveric, level 5 evidence, and studies utilizing the Beighton score self-assessment version were excluded. Data were extracted to compare scoring methods, population characteristics, and measurements of inter- and intrarater reliability. Risk of bias was assessed with the COSMIN (Consensus-Based Standards for the Selection of Health Measurement Instruments) 2017 checklist.
RESULTS: Twenty-four studies were analyzed (1333 patients; mean ± SD age, 28.19 ± 17.34 years [range, 4-71 years]; 640 females, 594 males, 273 unknown sex). Of the 24 studies, 18 reported raters were health care professionals or health care professional students. For interrater reliability, 5 of 8 (62.5%) intraclass correlation coefficients and 12 of 19 (63.2%) kappa values were substantial to almost perfect. Intrarater reliability was reported as excellent in all studies utilizing intraclass correlation coefficients, and 3 of the 7 articles using kappa values reported almost perfect values. Utilizing the COSMIN criteria, we determined that 1 study met "very good" criteria, 7 met "adequate," 15 met "doubtful," and 1 met "inadequate" for overall risk of bias in the reliability domain.
CONCLUSION: The Beighton score is a highly reliable clinical tool that shows substantial to excellent inter- and intrarater reliability when used by raters of variable backgrounds and experience levels. While individual components of risk of bias among studies demonstrated large discrepancy, most of the items were adequate to very good.

Entities: Chemical

Keywords: Beighton score; hypermobility; interrater; intrarater; systematic review

Year: 2021 PMID： 33786328 PMCID： PMC7960900 DOI： 10.1177/2325967120968099

Source DB: PubMed Journal: Orthop J Sports Med ISSN： 2325-9671

The Beighton score is the cornerstone for diagnosing hypermobility syndromes, including hypermobility spectrum disorder or hypermobile Ehlers-Danlos syndrome.[13,59] The original criteria do not provide a detailed description,[6] which leaves them open for interpretation and uncertainty of application. No threshold score is determined by the original description,[6] nor is there consensus throughout the literature on what defines hypermobility.[24,34] However, variations are seen in hypermobility depending on age, sex, and race; thus, some experts believe that thresholds should be individualized to subpopulations.[51,52] Given the imprecision of the Beighton score, studies utilizing it may be inconsistent in starting positions, performance, and benchmarks.[34] Questions left unanswered by the Beighton score include whether the tests should be performed actively by the respondent or passively by the clinician and whether a warm-up period is required.[35] The risk of these inherent shortcomings is that a lack of specificity could affect the score’s generalizable applicability and reliability. In addition, the Beighton score does not account for symptoms. Laxity is defined as excessive motion in a specific joint in an asymptomatic individual. “Excessive” relative to a joint, is defined as abnormally increased or supraphysiologic motion, also known as “hypermobility.” “Instability” is defined as excessive motion in a specific joint in a symptomatic individual. The key distinction between laxity and instability is the absence (former) or presence (latter) of symptoms. Historically, studies have consistently reported excellent reliability of the Beighton score. However, recent systematic reviews have reported these studies to show conflicting evidence, and they have cited concerns with the methodology based on requirements with COSMIN (Consensus-Based Standards for the Selection of Health Measurement Instruments) criteria that are clinically inapplicable to this score.[17,36] The training and experience of raters[26,42] and the time between examinations[33] have the potential to affect the measures of Beighton score reliability according to the current COSMIN criteria. Reliable, accurate, and precise measures for hypermobility are necessary for operative and nonoperative musculoskeletal care for clinicians and surgeons. Specifically, they can guide treatment choices in patellofemoral,[10] shoulder,[53] and hip instability[46] as well as anterior cruciate ligament (ACL) reconstruction.[41,55] Owing to the significant heterogeneity in evidence regarding the Beighton score, the purpose of this investigation was to succinctly present a systematic review to determine the inter- and intrarater reliability of the Beighton score and the methodological quality of all analyzed studies in the context of clinical applicability. We hypothesized that this systematic review will demonstrate excellent inter- and intrarater reliability and substantial methodological quality that is satisfactory for surgeons’ clinical use.

Methods

The review protocol was registered via the National Institute for Health Research’s PROSPERO International Prospective Register of Systematic Reviews (CRD42018081703).[28] The systematic review was conducted according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines.[43] Utilizing PICO (population, intervention, comparison, outcome) to fit a measurement tool, we examined research addressing humans of any age, degree of hypermobility, the Beighton score, and inter- and intrarater reliability. Therefore, it was determined that studies evaluating the clinical Beighton score between and among raters as a primary or secondary outcome would be included and all others would be considered the wrong outcome. Studies that utilized the Beighton self-assessment exclusively, in which patients independently measured and reported their own score, were excluded. Reviews, abstracts, theses, unpublished studies, articles not available in English, and studies with animal or cadaveric subjects were also excluded. A systematic computerized search (Appendix 1) was conducted by 1 author (L.N.B.) on January 30, 2018, in 4 databases (MEDLINE, Embase, CINAHL, and SPORTDiscus) with no limitations on dates of inclusion. To reduce the search bias, the search strategy was conducted using Medical Subject Headings. A search in ClinicalTrials.gov was also conducted to identify any possible ongoing studies. The search terms included, but were not limited to the following: Beighton, joint laxity, hypermobility, reproducibility of results, observer variation, reliability, interrater, or intrarater (Appendix 1). Identified records were imported to the systematic review software Rayyan (Qatar Computing Research Institute),[48] and duplicates were removed. Articles were screened in a 2-step process, first by title and abstract according to exclusion criteria. Second, articles included by abstract were imported into Rayyan; full texts were made available; and 2 authors (L.N.B. and A.M.V.) independently screened by reading the article abstract and the article full text for inclusion according to both eligibility criteria. Disagreements concerning final inclusion were settled by consensus between these authors during a deliberation session. The data extraction sheet was developed according to the Cochrane Consumers and Communication Review Group’s data extraction template,[30] was pilot tested on 3 randomly selected included studies, and then refined accordingly. One review author (L.N.B.) extracted the data from included studies, which the second author (A.M.V.) verified. Disagreements were resolved by discussion between them; if no agreement could be reached, it was planned that a third author (J.D.H.) would decide. No authors were contacted for additional information, and all missing data were labeled “not specified.” The included articles were independently assessed by 2 authors (L.N.B. and A.M.V.) for risk of bias using the COSMIN checklist.[44] The complete COSMIN checklist includes 12 boxes, covering internal consistency, reliability, measurement error, validity, and responsiveness. This review exclusively evaluated reliability (COSMIN box 6), which was determined to be crucial to the context in which inter- and intraobserver values were interpreted. The overall methodological quality of a study is determined by the lowest rating among the items in the reliability box (ie, “the worst score counts” principle), including “very good,” “adequate,” “doubtful,” and “inadequate.” Individual scores on the COSMIN “reliability” subitems were assessed and are included in Appendix 2 for completeness. COSMIN question 6.8, “other methodological flaws,” was not assessed because of the subjectivity of the question. To minimize selection bias, studies were not excluded on the basis of methodological quality, as they were evaluated only in the reliability domain and the lowest score determined the overall quality in reliability.

Box 6

Reliability

		Very Good	Adequate	Doubtful	Inadequate	Not applicable
Design requirements
1	Were patients stable in the interim period on the construct to be measured?	Evidence provided that patients were stable	Assumable that patients were stable	Unclear if patients were stable	Patients were NOT stable
2	Was the time interval appropriate?	Time interval appropriate		Doubtful whether time interval was appropriate or time interval was not stated	Time interval NOT appropriate
3	Were the test conditions similar for the measurements (eg type of administration, environment, instructions)?	Test conditions were similar (evidence provided)	Assumable that test conditions were similar	Unclear if test conditions were similar	Test conditions were NOT similar
Statistical methods
4	For continuous scores: Was an intraclass correlation coefficient (ICC) calculated?	ICC calculated and model or formula of the ICC is described	ICC calculated but model or formula of the ICC not described or not optimal. Pearson or Spearman correlation coefficient calculated with evidence provided that no systematic change has occurred	Pearson or Spearman correlation coefficient calculated WITHOUT evidence provided that no systematic change has occurred or WITH evidence that systematic change has occurred	No ICC or Pearson or Spearman correlations calculated	Not applicable
5	For dichotomous/nominal/ordinal scores: Was kappa calculated?	Kappa calculated			No kappa calculated	Not applicable
6	For ordinal scores: Was a weighted kappa calculated?	Weighted Kappa calculated		Unweighted Kappa calculated or not described		Not applicable
7	For ordinal scores: Was the weighting scheme described? eg linear, quadratic	Weighting scheme described	Weighting scheme NOT described			Not applicable
Other
8	Were there any other important flaws in the design or statistical methods of the study?	No other important methodological flaws		Other minor methodological flaws	Other important methodological flaws

From Mokkink LB, de Vet HCW, Prinsen CAC, et al. COSMIN risk of bias checklist for systematic reviews of patient-reported outcome measures. Qual Life Res. 2018;27(5):1171-1179.[44] Material distributed under the terms of the Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).

We defined “reliability” as reproducibility of test values in repeated trials on the same individuals,[32] quantified by inter- and intrarater reliability. Consistency of outcomes recorded from 1 participant examined by the same observer multiple times was defined as intrarater reliability, while reproducibility of the score among observers was defined as interrater reliability.[4] Since the level of measurement of the Beighton score is not defined, researchers use different statistics to quantify these 2 values. Nominal and ordinal data were analyzed with the Cohen or weighted kappa (κ) coefficient,[50] which varies from –1 to 1. COSMIN criteria favor weighted kappa values, which penalize disagreements in terms of their seriousness, over unweighted kappa values, which treat disparities equally.[14,56] Less rigorous expressions of inter- and intrarater reliability include percentage agreement and the Spearman rho. While percentage agreement is a direct measurement of the similarity between chosen values, it does not take into account the chance that scores were guessed[42] or the difference between more disparate scores. The Spearman rho expresses correlation between values on a scale of –1 to 1, with no known standards for reliability. This correlation reveals only how much values vary in relationship to each other, not the degree of agreement between them, allowing it to discount systematic differences.[9] These values were not considered adequate to express reliability according to COSMIN standards. No transformation of reported values was required, except for simplifications detailed in the legend of Tables 1 and 2. No quantitative assessment of risk of bias across studies could be computed with the measures of reliability, and no additional quantitative analysis was performed.

Table 1

Extracted Data

Population Description	Test Conditions	Whether Test Conditions Were Similar for the Measurements
Number of participants	Beighton score modifications	Participant sequence generation
Age	Examination setting	Whether sequence of participants was concealed
Sex	Number of raters	Blinding of raters
Diagnostic criteria	Rater professions	Key conclusions of study authors
Inclusion criteria	Experience	Statistical tests
Exclusion criteria	Training	COSMIN criteria
Time between measurements	Whether patients were stable in the interim

COSMIN, Consensus-Based Standards for the Selection of Health Measurement Instruments.

Table 2

Strength of Agreement for the Kappa Coefficient and Intraclass Correlations[14,39,40,5] [6]

Kappa Coefficient	Agreement	Intraclass Correlation	Reliability
≤0	Poor	0.5	Poor
0.01-0.20	Slight	>0.5-0.75	Moderate
0.21-0.40	Fair	>0.75-0.9	Good
0.41-0.60	Moderate	>0.90	Excellent
0.61-0.80	Substantial
0.81-1.00	Almost perfect

Extracted Data COSMIN, Consensus-Based Standards for the Selection of Health Measurement Instruments. Strength of Agreement for the Kappa Coefficient and Intraclass Correlations[14,39,40,5] [6]

Results

The database search strategy yielded 1250 records. Three articles not identified by these searches were discovered by literature citation and added to the screen. After the screening process delineated in Figure 1, a total of 24 records were determined to meet inclusion criteria.[‡]

Figure 1.

Flow diagram summarizing the literature search, screening, and review using the PRISMA (Preferred Reporting Items for Systematic Meta-Analyses) guidelines.

Flow diagram summarizing the literature search, screening, and review using the PRISMA (Preferred Reporting Items for Systematic Meta-Analyses) guidelines. Table 3 includes characteristics of all included studies and their corresponding COSMIN criteria. All 24 studies selected for review were published in English and were observational studies with level 4 evidence. Of the 14 articles that explicitly express time intervals between measurements, the longest was 12 to 16 weeks,[5] with 12 of 14 reporting ≤2 weeks. A total of 1333 participants were examined for reliability of the Beighton score across included trials, with a reported mean ± SD age of 28.19 ± 17.34 years (range, 4-71 years). Of the 24 studies, 8 had populations <18 years old, and 14 included a higher proportion of women than men (640 female, 594 male, 273 unknown). Six studies included athletes in their participant population, and 8 comprised patients with pathological conditions. Seven studies used goniometers in their protocol.

Table 3

Population Characteristics, Time Interval, Study Design, and Associated COSMIN Scores

Study (Year)	Population Characteristics	Time Interval				Study Design
Study (Year)	Sample, Age (y), Female Sex (%), DOP^b,c	6.1^d	Interrater	Intrarater	6.2^d	Test Condition	No. of Raters	Rater Profession	Combined Rater Experience^c	Rater Training	6.3^d
Aartun (2014)[1]	111, 12-14, 46.8, middle school students	VG	<4 d	1-4 h	VG	5 item	2	Chiropractors	18 y	Standardization session	VG
Aslan (2006)[3]	72, 20.36 ± 1.24 (18-25), 40.20, undergraduate PT students	VG	<24 h	12.84 ± 7.41 d	VG	5 item + goniometer	2	PTs	21 y	2 h practice together	VG
Baumhauer (1995)[5]	21, 18-23, 57, intercollegiate athletes	VG	12-16 wk	NA	VG	5 item	2	NS	NS	NS	A
Boyle (2003)[8]	42, 25.4 ± 4.2 (15-45), 100, noninjured HS athletes and PT students	VG	15-60 min	6 ± 4 d	VG	5 item + goniometer	2	PTs	17 y	CME, trained with index	VG
Bulbena (1992)[11]	173, 43.98^e NS, JHS with >5 Beighton system	VG	Consecutive	NA	D	5 item	2	Rheumatologists	Experienced	NS	A
Cooper (2018)[15]	50, 49 (22-60), 56, community members	VG	NS	1 wk	VG	5 item + goniometer	1	NS	NS	NS	A
Erdogan (2012)[18]	15, 31.8 (16-50), 59.15, treated for ingrown nails	VG	NS	NS	D	5 item + goniometer	2	Rheumatologists	NS	NS	A
Erkula (2005)[20]	50, 10.4 ± 1.2 (8-15)^f 46.97, asymptomatic students	VG	2 wk	NA	VG	5 item	2	Orthopaedic surgeons	NS	NS	A
Evans (2012)[21]	30, 10.6 ± 2.3 (7-15), 65, asymptomatic podiatry clinic patients	VG	>2 h	>2 h	VG	5 item	2	Podiatrists	21 y	NS	A
Fritz (2005)[22]	38, 39.2 ± 11^f 57,^f history of lower back pain	VG	5 min	NA	VG	5 item	2	PTs	NS	NS	A
Glasoe (2002)[23]	30, 14-24, 100, athletes	VG	NS	NA	VG	5 item	2	NS	>6 y	NS	A
Hansen (2002)[27]	100, 9-13, NS, asymptomatic competitive athletes	VG	NS	NA	D	4/5, no fifth finger	4	2 rheumatologists 1 untrained physician	NS	Guided by illustrations	A
Hicks (2003)[29]	63, 36 (20-66), 60.30, patients with lower back pain	VG	>15 min	NA	VG	5 item	4	3 PT, 1 PT and chiropractor	20 y	Group review, 1 h practice	VG
Hirsch (2007)[31]	50, 38.3 ± 11.3 (20-60), 56, asymptomatic	VG	NS	24.6 d	VG	5 item + goniometer	2	Dentists	NS	Instructions, directed by orthopaedic surgeon	VG
Junge (2013)[34]	103, 7-8 and 10-12, 44^e healthy school children	VG	<30 min	NA	VG		4	PT students	NS	Trained	VG
Juul-Kristensen (2007)[35]	40, 42.27 (18-71)^e 68.33,^e BJHS, EDS, back/shoulder pain	VG	NS	NA	D	5 item	2	NS	NS	Trained per protocol	VG
Karim (2011)[37]	30, 24 (18-32), 100, contemporary professional dancers	VG	NS	NA	VG	5 item	4	1 PT, 3 PT students	30 y	PT trained students	VG
Naal (2014)[46]	55, 28.5 ± 4.1, 32.70, symptomatic FAI cases	VG	NS	NA	D	5 item	2	Clinicians	NS	NS	A
Pitetti (2015)[49]	25, 13.3 ± 2.9, 44, intellectually disabled	VG	3-4 wk	NA	VG	5 item + goniometer	2	DPT students	None	Peer supportive learning	VG
Smith (2012)[57]	5, 27, 100, patellar instability patients	VG	<1 d	30 min	VG	5 item	5	Orthopaedic surgeons	125 y	Familiarized	VG
Tarara (2014)[58]	19, 20.3 ± 1.2 (male), 19.8 ± 1.0 (female), 57.89, club athletes	VG	<2.5 h	4-7 d	VG	5 item	3	1 clinician and 2 novice students	22 y	Prior reading, 1 h training and questions	VG
Vaishya (2013)[62]	300, 24.6 ± 0.9, 36.67, postoperative ACL reconstruction and controls	VG	NS	NA	D	5 item	2	NS	NS	NS	A
Vallis (2015)[63]	36, 22.7 (18-32), 75, asymptomatic PT and OT students	VG	<1 d, 1 wk	NA	VG	5 item + goniometer	2	Researchers	NS	Teaching session	VG
van der Giessen (2001)[65]	48, 4-12, 48.9^f primary schoolchildren	VG	NS	NA	D	5 item	2	PT students	1 mo	Professional PT trained students	VG

ACL, anterior cruciate ligament; BJHS, benign joint hypermobility syndrome; CME, continuing medical education; DOP, description of participants; DPT, doctorate of physical therapy; EDS, Ehlers-Danlos syndrome; FAI, femoroacetabular impingement; HS, high school; JHS, joint hypermobility syndrome; NA, not available/applicable; NS, not specified; OT, occupational therapy; PT, physical therapy.

Age reported as mean ± SD or range.

Calculated.

COSMIN criterion (Consensus-Based Standards for the Selection of Health Measurement Instruments; see Appendix 2 for details). Scoring: VG = very good, A = adequate, D = doubtful, I = inadequate.

Weighted average of groups or 2-phase studies.

Demographics of larger sample, of which reliability population is a subgroup.

Population Characteristics, Time Interval, Study Design, and Associated COSMIN Scores ACL, anterior cruciate ligament; BJHS, benign joint hypermobility syndrome; CME, continuing medical education; DOP, description of participants; DPT, doctorate of physical therapy; EDS, Ehlers-Danlos syndrome; FAI, femoroacetabular impingement; HS, high school; JHS, joint hypermobility syndrome; NA, not available/applicable; NS, not specified; OT, occupational therapy; PT, physical therapy. Age reported as mean ± SD or range. Calculated. COSMIN criterion (Consensus-Based Standards for the Selection of Health Measurement Instruments; see Appendix 2 for details). Scoring: VG = very good, A = adequate, D = doubtful, I = inadequate. Weighted average of groups or 2-phase studies. Demographics of larger sample, of which reliability population is a subgroup. Raters in at least 18 of the 24 studies were health care professionals (HCPs) or HCP students. Eight studies had physical therapists or physical therapy students as raters; 2 studies, orthopaedic surgeons; 3 studies, rheumatologists; and in the 5 other studies, other HCP disciplines that were not specified in the article. One study referred to its raters as “researchers.” None of the studies included HCPs with equal years of experience. Half of the studies did not report the HCPs’ years of experience at all. For the studies that did report years of experience, the numbers for each HCP were summed to reach combined total years for Table 3. Table 4 includes measures of reliability in each study and the corresponding COSMIN criteria. Because the study designs, participants, interventions, and reported outcome measures varied markedly, results were synthesized in a qualitative manner, and pooled means could not be determined. Because 3 studies included reliability statistics for >1 cutoff score (ie, ≥4/9 and composite), the included 24 articles reported interrater reliability values for 27 cutoff scores. For interrater reliability, 5 of the 27 scoring cutoffs were ≥4 of 9; 3 were ≥5 of 9; 13 were composite (total of 9 points); 4 used each item in the Beighton score; and 1 used a modified composite scale. Intrarater reliability was expressed for 10 total cutoff values: 3 were ≥4 of 9; 1 was ≥5 of 9; 5 included composite values; and 1 calculated a score for each item. Of the 8 studies that utilized intraclass correlation (ICC) to express interrater reliability, 1 found an excellent value; 4, good; 1, moderate to good; and 2, moderate. Of the 19 kappa values or ranges for interrater reliability, 3 were almost perfect; 6 were substantial; 2 were moderate; 1 was poor; and the others ranged between scales. Of the 7 ranges, 3 crossed between substantial and almost perfect, while the other 4 varied among lower ratings. Three studies used percentage agreement values, and 3 studies used the Spearman rho to demonstrate interrater reliability. For interrater reliability, 5 of 8 (62.5%) ICCs and 12 of 19 (63.2%) kappa values were better than moderate. Of the 13 intrarater values provided, 3 were ICC; 7 were kappa; 2 were percentage agreement; and 1 was a Spearman rho. All 3 ICC values for intrarater reliability were excellent. For the 7 kappa values and ranges, 2 were almost perfect; 2, substantial; 1, fair; and 2 had scores varying from substantial to almost perfect.

Table 4

Inter- and Intrarater Reliability and Associated COSMIN Scores

		Reliability, Mean (95% CI)		COSMIN Item
Study (Year)	Cutoff Score	Interrater	Intrarater	6.4	6.5	6.6	6.7
Aartun (2014)[1]	≥4/9	κ = 0.65 (0.33 to 0.97)	κ = 0.66-1 (0.03 to 1)	NA	VG	D	A
Aartun (2014)[1]	≥5/9	κ = 0.56 (0.11 to 1.00)	κ = 1
Aslan (2006)[3]	Composite	ICC = 0.82Agreement = 42%	ICC = 0.92Agreement = 43%	A	NA	NA	NA
Baumhauer (1995)[5]	Composite	ρ = 1		NA	I	D	A
Boyle (2003)[8]	Composite	ρ = 0.87Agreement = 51%	ρ = 0.86Agreement = 69%	D	NA	NA	NA
Bulbena (1992)[11]	Each item	κ = 0.79-0.93		D	VG	D	NA
Cooper (2018)[15]	≥4/9	κ = 0.96^b (0.87 to 1.00)	κ = 1	NA	VG	D	A
Erdogan (2012)[18]	Each item	κ = 0.71-1.0	κ = 0.81-1.0	NA	VG	D	A
Erkula (2005)[20]		ρ = 0.86	ρ = 0.62	D	NA	NA	NA
Evans (2012)[21]	Composite	ICC = 0.73	ICC = 0.96-0.98	VG	NA	NA	NA
Fritz (2005)[22]	Composite	ICC = 0.72 (0.50 to 0.85)		VG	NA	NA	NA
Glasoe (2002)[23]	Composite	κ = 0.7		NA	VG	D	A
Hansen (2002)[27]	≥4/9	κ = 0.44-0.82		D	VG	D	A
Hicks (2003)[29]	Composite	ICC = 0.79 (0.68 to 0.87)		VG	NA	NA	NA
Hirsch (2007)[31]	≥4/9	ICC >0.84	ICC > 0.89	A	NA	NA	NA
Junge (2013)[34]	Each item ^c	κ = 0.49-0.94, 0.30-0.84		NA	VG	D	A
Junge (2013)[34]	≥5/9^c	κ = 0.64, 0.59^d
Juul-Kristensen (2007)[35]	Composite	ICC = 0.91		VG	VG	D	A
Juul-Kristensen (2007)[35]	≥5/9	κ = 0.66 (0.30 to 1.02)0.74 (0.46 to 1.02)^d
Karim (2011)[37]	NS	κ = 0.6Agreement = 54%-100%		NA	VG	D	NA
Naal (2014)[46]	Composite	κ = 0.82^b (0.72 to 0.91)		NA	VG	VG	VG
Pitetti (2015)[49]	Composite	ICC = 0.88		A	VG	D	A
Pitetti (2015)[49]	Each item	κ = 0.45-0.80
Smith (2012)[57]	Composite	κ = 0.00 (−0.16 to 0.17)	κ = 0.25 (0.03 to 0.51)	NA	VG	VG	A
Tarara (2014)[57]	Modified composite^e	κ = 0.64-0.69^f κ = 0.72^g (0.62 to 0.82)	Expert: κ = 0.69 (0.46 to 0.92)Novice: κ = 0.72-0.73 ([0.53-0.90] to [0.58-0.89])	NA	VG	VG	A
Vaishya (2013)[62]	≥4/9	κ = 0.7		NA	VG	D	A
Vallis (2015)[63]	Composite	ICC = 0.72-0.80 ([0.51-0.84] to [0.64-0.89])κ = 0.71-0.82 ([0.67-0.90] to [0.50-0.84])		A	VG	VG	A
van der Giessen (2001)[65]	Composite	κ = 0.81		NA	VG	D	A

A, adequate; COSMIN, Consensus-Based Standards for the Selection of Health Measurement Instruments; D, doubtful; I, inadequate; ICC, intraclass correlation; NA, not available/applicable; VG, very good.

Observer-participant reliability.

Percentage agreement omitted.

For 2 distinct methods of performing Beighton score.

Modified composite scale: 0 = pain with test, 1 = 8-9 points, 2 = 6-7 points, 3 = 4-5 points, 4 = 2-3 points, 5 = 0-1 points.

Expert-novice rater reliability.

Novice-novice rater reliability.

Inter- and Intrarater Reliability and Associated COSMIN Scores A, adequate; COSMIN, Consensus-Based Standards for the Selection of Health Measurement Instruments; D, doubtful; I, inadequate; ICC, intraclass correlation; NA, not available/applicable; VG, very good. Observer-participant reliability. Percentage agreement omitted. For 2 distinct methods of performing Beighton score. Modified composite scale: 0 = pain with test, 1 = 8-9 points, 2 = 6-7 points, 3 = 4-5 points, 4 = 2-3 points, 5 = 0-1 points. Expert-novice rater reliability. Novice-novice rater reliability. Out of the 168 COSMIN questions in the reliability domain across all studies, 79 (47%) were “very good”; 29 (17%), “adequate”; 24 (14%), “doubtful”; 1, “inadequate”; and 35 (21%) did not apply. Utilizing the COSMIN “worse score counts” principle, we determined that 1 (4%) study met “very good” criteria[29]; 7 (29%) met “adequate”[3,21,22,31,57,58,63]; 15 (63%) met “doubtful”[§]; and 1 (4%) met “inadequate”[5] for overall risk of bias in the reliability domain. Eight (33.33%) studies utilized ICC, and 16 (66.66%) comprised 19 kappa statistics to express interrater reliability, of which 4 (25%) used weighted kappa values. Of the 12 articles that included unweighted kappa values, 6 received an overall score of “doubtful,” which was attributed only to question 6.6, regarding use of weighted kappa,[44] when they otherwise would have received “adequate” or “very good” overall. Of the 24 included studies, 7 did not report an explicit time interval between reliability measurements. However, 6 of the 7 had another doubtful measure, which means that question 6.2, regarding the appropriateness of the time interval,[44] did not greatly affect the overall score for most studies.

Discussion

This systematic review has demonstrated high inter- and intrarater reliability for the Beighton score in individuals with and without hypermobility in a variety of clinical conditions. As demonstrated by the data derived from Table 3, varying time conditions, population characteristics, measurement tools, measurer education and training, and the Beighton score cutoff did not greatly influence the reliability of this test. Most studies demonstrated substantial to almost perfect interrater reliability values. Intrarater reliability was excellent or almost perfect in more than half of analyzed investigations. The quality of analyzed evidence was adequate, in contrast to findings in previous systematic reviews.[35] The increased mobility seen in patients with an elevated Beighton score is of importance for the clinician. Generalized joint hypermobility is a risk factor for many musculoskeletal conditions, such as multidirectional shoulder instability,[54] hip instability,[12] femoroacetabular impingement,[46,64] hip dysplasia,[2,7] ACL injury,[60,62] flatfoot,[45] ankle sprains,[16] and many others. Clinicians should have a high index of suspicion for these conditions in this population. Knowledge of hypermobility influences patient selection for surgical versus nonsurgical treatments, the actual surgical technique employed, and the expected prognosis and outcome with respect to risks for recurrence of symptoms (which may vary along a spectrum of instability).[19] This is important in the clinical setting for practitioners to avoid unnecessary imaging or interventions or the misdiagnosis of chronic pain.[66] Patients with hypermobility may warrant more aggressive rehabilitation or injury prevention protocols. Owing to the higher incidence of joint instability in patients with hypermobility, it has been suggested that these patients undergo prolonged strengthening, proprioception, and generalized conditioning programs when considering initial nonoperative treatment.[66] Additionally, considerations in operative intervention may change with the knowledge of a patient’s hypermobility status. For instance, a surgeon might consider an open inferior capsular shift versus arthroscopic capsular plication for the hypermobile shoulder, or a surgeon may consider using a patellar tendon autograft over hamstring tendon autograft in ACL reconstruction[38] to ensure greater stability postoperatively. Arthroscopic hip preservation surgeons may employ greater degrees of capsular plication and/or inferior capsular shift in patients undergoing FAI syndrome and labral injury surgical treatment.[61] Even trauma and arthroplasty surgeons should consider a patient’s hypermobility status. Patients with hypermobility have been found to have lower bone density[25,47] than controls, which leaves them at greater risk for fixation and implant failure and fracture. Postoperative protocols may need to be adjusted for this population to address the increased laxity. Thus, use of a reliable system, such as the Beighton score, for identifying these patients is essential to providing the most comprehensive musculoskeletal care. Limitations of the present study include the quality of studies available in the literature, the failure of studies to include time intervals between intrarater measures, reporting bias, and lack of rater standardization or comparison. Studies that did not include time intervals between intrarater measures resulted in a summary COSMIN score of “doubtful.” Laxity may change in an individual over a period of decades[3,11,16,66]; however, it does not change over short periods. Thus, the omission of time intervals should not negatively affect a clinician’s evaluation of the evidence supporting inter- and intrarater reliability of the Beighton score. Additionally, score reporting is subject to publication bias and selective reporting because reliability may be reported by composite score, individual measurement score, or cutoff score. This may influence authors to choose the reporting measure with the most desirable outcomes. Studies that measure interrater reliability risk underestimating it when raters are not properly standardized. Using raters with unequal experience may result in artificially low interrater statistics. All studies in the present review used raters with different levels of experience; thus, it is likely that under standardized conditions the interrater reliability may be higher. No one study utilized raters of different professions; therefore, the discrepancy in Beighton score reliability among health care disciplines cannot be evaluated by this study.

Conclusion

The Beighton score is a reliable clinical assessment tool that shows acceptable reliability when used by raters of any background or experience level. Studies demonstrate immense variability in participant population, study design, time interval, and rater experience yet consistently report substantial to excellent inter- and intrarater reliability. While individual components of risk of bias among studies also demonstrated large discrepancy, most of the items were adequate to very good.

60 in total

1. Association of hypermobility and ingrown nails.

Authors: Fatma Gulru Erdogan; Abdurrahman Tufan; Munevver Guven; Berna Goker; Aysel Gurler
Journal: Clin Rheumatol Date: 2012-06-02 Impact factor: 2.980

2. Reliability of the Beighton Hypermobility Index to determinate the general joint laxity performed by dentists.

Authors: Christian Hirsch; Monique Hirsch; Mike T John; Jens Johannes Bock
Journal: J Orofac Orthop Date: 2007-09 Impact factor: 1.938

3. Reconstruction of the coracoclavicular and acromioclavicular ligaments with semitendinosus tendon graft: a pilot study.

Authors: Maristella F Saccomanno; Mario Fodale; Luigi Capasso; Gianpiero Cazzato; Giuseppe Milano
Journal: Joints Date: 2014-05-08

Review 4. Orthopaedic management of the Ehlers-Danlos syndromes.

Authors: William B Ericson; Roger Wolman
Journal: Am J Med Genet C Semin Med Genet Date: 2017-02-13 Impact factor: 3.908

5. Volumetric definition of shoulder range of motion and its correlation with clinical signs of shoulder hyperlaxity. A motion capture study.

Authors: Mickaël Ropars; Armel Cretual; Hervé Thomazeau; Rajiv Kaila; Isabelle Bonan
Journal: J Shoulder Elbow Surg Date: 2014-09-03 Impact factor: 3.019

6. CURRENT CONCEPTS IN THE TREATMENT OF GROSS PATELLOFEMORAL INSTABILITY.

Authors: Grant Buchanan; LeeAnne Torres; Brian Czarkowski; Charles E Giangarra
Journal: Int J Sports Phys Ther Date: 2016-12

7. Hypermobility syndrome increases the risk for low bone mass.

Authors: Selmin Gulbahar; Ebru Sahin; Meltem Baydar; Ciğdem Bircan; Ramazan Kizil; Metin Manisali; Elif Akalin; Ozlen Peker
Journal: Clin Rheumatol Date: 2005-11-26 Impact factor: 2.980

8. Test-retest reliability of ankle injury risk factors.

Authors: J F Baumhauer; D M Alosa; A F Renström; S Trevino; B Beynnon
Journal: Am J Sports Med Date: 1995 Sep-Oct Impact factor: 6.202

9. Inter-tester reproducibility and inter-method agreement of two variations of the Beighton test for determining Generalised Joint Hypermobility in primary school children.

Authors: Tina Junge; Eva Jespersen; Niels Wedderkopp; Birgit Juul-Kristensen
Journal: BMC Pediatr Date: 2013-12-21 Impact factor: 2.125

10. Interrater reliability: the kappa statistic.

Authors: Mary L McHugh
Journal: Biochem Med (Zagreb) Date: 2012 Impact factor: 2.313

2 in total

1. Capsule Closure of Periportal Capsulotomy for Hip Arthroscopy.

Authors: Rami George Alrabaa; Abhishek Kannan; Alan L Zhang
Journal: Arthrosc Tech Date: 2022-06-21

2. Assessment of systemic joint laxity in the clinical context: Relevance and replicability of the Beighton score in chronic fatigue.

Authors: Gabriella Bernhoff; Helena Huhmar; Lina Bunketorp Käll
Journal: J Back Musculoskelet Rehabil Date: 2022 Impact factor: 1.456

2 in total