Itzel Paola Melgoza1, Srish S Chenna1, Steven Tessier2, Yejia Zhang3, Simon Y Tang4, Takashi Ohnishi2,5, Emanuel José Novais2,6, Geoffrey J Kerr7, Sarthak Mohanty3, Vivian Tam8, Wilson C W Chan8,9, Chao-Ming Zhou10, Ying Zhang8, Victor Y Leung11, Angela K Brice3, Cheryle A Séguin7, Danny Chan8,9, Nam Vo10, Makarand V Risbud2, Chitra L Dahia1,12. 1. Orthopedic Soft Tissue Research Program Hospital for Special Surgery New York City New York USA. 2. Department of Orthopaedic Surgery Sidney Kimmel Medical College, Thomas Jefferson University Philadelphia Pennsylvania USA. 3. University of Pennsylvania Philadelphia Pennsylvania USA. 4. Department of Orthopaedic Surgery Washington University in St Louis Missouri USA. 5. Department of Orthopaedic Surgery Faculty of Medicine and Graduate School of Medicine, Hokkaido University Sapporo Japan. 6. Lewis Katz School of Medicine at Temple University Philadelphia Pennsylvania USA. 7. Department of Physiology & Pharmacology Bone & Joint Institute, University of Western Ontario London Ontario Canada. 8. School of Biomedical Sciences The University of Hong Kong Pokfulam Hong Kong. 9. Department of Orthopaedic and Traumatology The University of Hong Kong-Shenzhen Hospital Shenzhen Guangdong China. 10. Department of Orthopaedic Surgery University of Pittsburgh Pennsylvania USA. 11. Department of Orthopaedics and Traumatology The University of Hong Kong Pokfulam Hong Kong. 12. Department of Cell & Developmental Biology Weill Cornell Medicine Graduate School of Medical Sciences New York City New York USA.
Abstract
Mice have been increasingly used as preclinical model to elucidate mechanisms and test therapeutics for treating intervertebral disc degeneration (IDD). Several intervertebral disc (IVD) histological scoring systems have been proposed, but none exists that reliably quantitate mouse disc pathologies. Here, we report a new robust quantitative mouse IVD histopathological scoring system developed by building consensus from the spine community analyses of previous scoring systems and features noted on different mouse models of IDD. The new scoring system analyzes 14 key histopathological features from nucleus pulposus (NP), annulus fibrosus (AF), endplate (EP), and AF/NP/EP interface regions. Each feature is categorized and scored; hence, the weight for quantifying the disc histopathology is equally distributed and not driven by only a few features. We tested the new histopathological scoring criteria using images of lumbar and coccygeal discs from different IDD models of both sexes, including genetic, needle-punctured, static compressive models, and natural aging mice spanning neonatal to old age stages. Moreover, disc sections from common histological preparation techniques and stains including H&E, SafraninO/Fast green, and FAST were analyzed to enable better cross-study comparisons. Fleiss's multi-rater agreement test shows significant agreement by both experienced and novice multiple raters for all 14 features on several mouse models and sections prepared using various histological techniques. The sensitivity and specificity of the new scoring system was validated using artificial intelligence and supervised and unsupervised machine learning algorithms, including artificial neural networks, k-means clustering, and principal component analysis. Finally, we applied the new scoring system on established disc degeneration models and demonstrated high sensitivity and specificity of histopathological scoring changes. Overall, the new histopathological scoring system offers the ability to quantify histological changes in mouse models of disc degeneration and regeneration with high sensitivity and specificity.
Mice have been increasingly used as preclinical model to elucidate mechanisms and test therapeutics for treating intervertebral disc degeneration (IDD). Several intervertebral disc (IVD) histological scoring systems have been proposed, but none exists that reliably quantitate mouse disc pathologies. Here, we report a new robust quantitative mouse IVD histopathological scoring system developed by building consensus from the spine community analyses of previous scoring systems and features noted on different mouse models of IDD. The new scoring system analyzes 14 key histopathological features from nucleus pulposus (NP), annulus fibrosus (AF), endplate (EP), and AF/NP/EP interface regions. Each feature is categorized and scored; hence, the weight for quantifying the disc histopathology is equally distributed and not driven by only a few features. We tested the new histopathological scoring criteria using images of lumbar and coccygeal discs from different IDD models of both sexes, including genetic, needle-punctured, static compressive models, and natural aging mice spanning neonatal to old age stages. Moreover, disc sections from common histological preparation techniques and stains including H&E, SafraninO/Fast green, and FAST were analyzed to enable better cross-study comparisons. Fleiss's multi-rater agreement test shows significant agreement by both experienced and novice multiple raters for all 14 features on several mouse models and sections prepared using various histological techniques. The sensitivity and specificity of the new scoring system was validated using artificial intelligence and supervised and unsupervised machine learning algorithms, including artificial neural networks, k-means clustering, and principal component analysis. Finally, we applied the new scoring system on established disc degeneration models and demonstrated high sensitivity and specificity of histopathological scoring changes. Overall, the new histopathological scoring system offers the ability to quantify histological changes in mouse models of disc degeneration and regeneration with high sensitivity and specificity.
Histopathology evaluates cells, tissues, and organs at the microscopic level to better understand the medical condition's clinical diagnosis. Histopathological analysis is a crucial outcome measure for determining disease progression, the degenerative, or regenerative state of the tissues, such as in the intervertebral disc (IVD), both clinically and in preclinical research. The IVD is a heterogeneous tissue forming a joint between each vertebra in the spine. Each IVD has three components; a center core of nucleus pulposus (NP), surrounded by orthogonal concentric layers of annulus fibrosus (AF) and connected to adjacent vertebrae by endplates (EP). Pathological degeneration of the IVD is a significant cause of chronic neck and lower back pain, a substantial socioeconomic burden affecting the quality of life of millions of people globally, but with no effective disease‐modifying treatment.
,
,
,
Degeneration of the IVD is multi‐factorial, stemming from natural aging, injury, herniation, bulging, or fracture of lumbar vertebrae or facet joints, affecting its overall structure and function (reviewed in References 5, 6, 7). Histopathological evaluations are observational analyses that categorize samples based on features of cellular and structural changes. To quantify observational histopathological data, it is essential to:establish a criterion for categorizing the features of healthy IVD and those observed with its progressive pathologies that are recognizable and quantifiable,harmonize terminology,determine the ease of understanding the scoring criteria statistically by testing the agreement of scores from several randomly chosen independent observers on given samples, andstatistically evaluate the sensitivity of included features for quantifying IVD pathologies.Preclinical animal models are valuable tools to study human diseases and test therapeutic interventions. In musculoskeletal research, including IVD and spine, several small and large preclinical animal models are employed based on each model system's advantages and the scope of the study. Due to several similarities between the mouse model and humans, such as their high genetic similarity and notable anatomical and physiological similarities, mice have been widely used to study musculoskeletal disorders and other human diseases. The mouse model offers the advantage of precise and conditional genetic manipulation for mechanistic and functional studies to model IVD degeneration and back pain‐related conditions. Comparative studies have demonstrated that the mouse lumbar IVDs are geometrically least deviated from humans than other preclinical animal models used for IVD research.
Moreover, following geometric normalization, mouse IVDs were reported to be closer to humans with regards to torsion mechanics and collagen content.
Additionally, the vertebra of a few mouse strains including friend virus B (FVB) does not have a secondary center of ossification till skeletal maturity,
,
or even till about 2 years of age (References 12, 13, 14, 15 and Figure 2E) and the EP is connected to the vertebral growth plate (GP).With the widespread use of mice as a preclinical animal model to understand IVD pathologies (reviewed in References 16 and 17), it is crucial to establish an effective histopathological scoring system that can capture the key known features of human IVD pathologies found in various mouse IVD degeneration models, enabling better cross‐study comparisons. This study aims to develop a comprehensive mouse IVD histopathological scoring system that evaluates histopathology in all regions of mouse IVDs with high sensitivity and specificity to allow cross‐comparison between different mouse models of IVD degeneration and regeneration. We considered the strengths and weaknesses of previously reported scoring systems, incorporated feedback from multiple spine research groups, and captured features of human IVD pathologies that are observed in mouse IVDs. Also, consideration was given to balance the simplicity of scoring features, specificity, sensitivity, ease of adaptability to various mouse models of IVD degeneration, and higher inter‐rater and intra‐rater agreement. This article describes the development of a new mouse IVD histopathological scoring system, where (a) we evaluate the IVD pathological features and develop new histopathological scoring criteria; (b) test the scoring criteria for agreement between raters; (c) validate the sensitivity and specificity of the scoring criteria using machine learning algorithms; and (d) apply the scoring criteria to various mouse models of IVD degeneration to analyze it's adaptability (Figure 1).
FIGURE 1
Pipeline for development of the new mouse IVD histopathological scoring system. The workflow for development of “MERCY” (Mouse intErveRtebral disC histopathologY) included development of the new scoring system, testing reliability using multi‐raters, validation by applying AI and machine learning algorithms and application on established models of IVD degeneration for sensitivity and specificity
Pipeline for development of the new mouse IVD histopathological scoring system. The workflow for development of “MERCY” (Mouse intErveRtebral disC histopathologY) included development of the new scoring system, testing reliability using multi‐raters, validation by applying AI and machine learning algorithms and application on established models of IVD degeneration for sensitivity and specificity
RESULTS
Development of a new mouse histopathological scoring system
To develop a new mouse IVD histopathological scoring criterion, we first evaluated the pathologies described in the literature and by gathering the best practices from the spine research community.
Evaluation of normal mouse IVD and naturally occurring pathologies
First, we evaluated the naturally occurring age‐related pathologies in mouse IVDs. The classifications of normal postnatal growth (less than 3 months, 3 M), maturation (3‐6 M), middle age (10‐14 M), old (18‐24 M), and very old (>24 M) age are based on guidance from Jackson Laboratories for mice.
Naturally occurring pathologies in mouse IVDs are observed only after 16 to 18 M of age,
and by about 24 M of age.
,
,
,
,
,
,
,
,
In summary, histology of a healthy IVD in neonatal and mature mice is characterized by evenly spread stellate or spindle‐shaped NP cells (Figure 2A,B). The AF lamella in neonatal mouse IVDs continue to develop (Figure 2A) but become organized into concentric layers by 1 month of age, and at this time, EP has defined layers (Figure 2B). IVDs of skeletally mature mice (~3 M old) maintain normal histological features (Figure 2C). In the lumbar IVDs of middle‐aged mice (~12 M), the NP cells cluster together and may not be spindle‐shaped, the AF becomes thin, and its lamellae separate or show clefts, while the EPs may not change much (Figure 2D). The lumbar IVDs of old or very old mice have fewer NP and AF cells isolated in lacunae, with one or more nuclei. The AF loses its defined lamellar structure, protrudes inwards towards the NP or outwards. The AF of aged IVDs may lose its integration into the EP. The EP of aged IVDs may lose cells or show cells that are isolated in lacunae; the EP may have features of micro‐fissures or tears/ fracture and fibrosis from the EP into the NP region. The boundaries between IVDs regions may be lost, with visually evident loss of all IVD cell types, or with few cells in lacunae in each region, the lamellar structure in AF may be unrecognizable, and the EP may have several clefts and fissures (Figure 2E).
FIGURE 2
Natural growth and aging of mouse lumbar IVD. Representative H&E‐stained microscopic images of mouse lumbar IVDs at P7 (A), 1 M (B), 3 M (C), 12 M (D), and 28 M (E) of age prepared in the coronal plane. The black arrow in P7 IVD shows the immature cells in inner AF (A). The black arrow in 28 M IVD shows loss of demarcation between NP and AF and loss of AF integration into EP (E). AF, annulus fibrosus; EP, endplate; GP, growth plate; NP, nucleus pulposus. Scale bar = 200 μm
Natural growth and aging of mouse lumbar IVD. Representative H&E‐stained microscopic images of mouse lumbar IVDs at P7 (A), 1 M (B), 3 M (C), 12 M (D), and 28 M (E) of age prepared in the coronal plane. The black arrow in P7 IVD shows the immature cells in inner AF (A). The black arrow in 28 M IVD shows loss of demarcation between NP and AF and loss of AF integration into EP (E). AF, annulus fibrosus; EP, endplate; GP, growth plate; NP, nucleus pulposus. Scale bar = 200 μm
Review of the published mouse IVD histopathological scoring systems
Next, we reviewed the published IVD histopathological scoring systems, focusing on ones developed using rodent models or adopted to quantify pathologies in mouse IVDs. We short‐listed three IVD histopathological scoring systems developed using mouse models
,
,
and adapted by studies using mouse model,
,
,
,
,
,
,
,
and one developed in rat
and adopted for scoring mouse IVDs.
,
Two scoring systems were developed on human IVD samples
,
(Figure 3) but later adapted for scoring mouse IVDs based on histopathological and microscopic features (References 34, 37, 38, 39, 40, 41, 42 to name a few; Table 1). The original Thompson grading system evaluates structural changes in human IVD at the macroscopic level and is not suitable to quantify histological changes. Next, we compared these scoring systems for features analyzed, scoring range (Figure 3), experimental models, standard operating procedures (SOPs) for histological preparation of IVD samples, and statistical analysis for testing the reliability of the scoring system (Table 1). The needle‐puncture model was used for modeling IVD degeneration in all studies for developing mouse histopathological systems.
,
,
,
IVDs of static compression models and genetic mutants were assessed by one study
(Table 1). IVDs from aging rodents, both mice and rats, were not tested in the original studies, overlooking the naturally occurring pathologies. The Tam et al, study did analyze the IVDs from aged mice to develop the scoring criteria. While fibrosis in the NP region was considered by one rodent scoring system,
NP and AF cellularity and matrix features were considered by all (Figure 3). However, none of the previous rodent IVD histopathological scoring systems analyzed the presence of notable pathological features of degenerating human IVDs, including the presence of cells in lacuna,
protrusion of AF, vascularization of AF,
features also observed in IVDs of aging mice.
,
,
The EP was not included in any of the previous rodent IVD histopathological scoring systems (Table 2). EP grading schema was proposed in a recent study.
The NP‐AF boundary was considered for scoring the interface region by a few studies (Figure 3). All histopathological scoring systems categorized the pathological features on an ordinal scale of an equal interval (Figure 3). All scoring systems, except Thompson, assigned zero (0) to the healthy or non‐degenerate IVDs. The highest score given to the most severely degenerated IVDs varied between the studies, and so did the scoring range (Figure 3). The Tam et al study attributed the highest scores of “four” based on the presence of NP mineralization as observed in the sacral IVD, which physiologically mineralizes and fuses before skeletal maturity and is not a degenerative phenotype.
Hence, the severely degenerative phenotype in mouse IVDs cannot be scored accurately. All studies tested their scoring systems using blinded raters for inter‐rater reliability (Table 1). Reliability was tested by applying different algorithms including Fleiss's multi rater kappa (κ) for absolute agreement,
and weighted κ for testing the magnitude of agreement. Intra‐rater reliability was reported by only a few of the studies (Table 1).
FIGURE 3
Summary of published histopathological scoring systems. The chart shows features analyzed and scoring range from the listed histopathological scoring systems
TABLE 1
Summary of the previous IVD histopathological scoring systems utilized for grading mouse IVDs
Ohnishi et al, 2016
Tam et al, 2018
Tian et al, 2018
Han et al, 2008
Thompson et al, 1990
Boos et al, 2002
Species
Mouse
Mouse
Mouse
Rat
Human (adopted in mouse)
Human (adopted in mouse)
Sex
Male; Female
Male; Female
Female
Male
Male; Female
Not reported
Strain
C57BL/6J
129S9/SvEvH; C57BL/6J; ICR; F1 (C57BL/ 6 N CBA/Ca)
Quantitative data was presented as the means of three evaluations
Calculated average score of scorers
Calculated scores for NP and AF. Calculated average scores of 3 raters
Not reported
Grades of first replicate for each observer were averaged
Not reported
Inter‐rater reliability test
Kappa (algorithm not reported)
Fleiss' multi‐rater kappa
Weighted kappa
Cohen's kappa
Cohen's kappa
Weighted kappa
Results
κ =0.85‐1.0
NP structure: κ = 0.562
Kappa values not reported
Combined κ = 0.77
Combined κ = 0.67‐0.94
κ = 0.493‐0.977
Kappa for each feature not reported
Cleft/fissures in the NP: κ = 0.574
Kappa for each feature not reported
Agreement between assigned and average grades:
Kappa for each feature not reported
Cleft/fissures in AF: κ = 0.423
Grade 1: 85%
AF/NP boundary: κ = 0.203
Grade 2: 92%
AF structure: κ = 0.131
Grade 3: 68%
Grade 4: 90%
Grade 5: 76%
Intra‐rater reliability
Kappa
Not reported
Not reported
Cohen's kappa
Percent agreement: 85‐87%
Not reported
κ = 0.85–1.0
κ = 0.84
Cohen's kappa. κ = 0.87‐0.91
Examples of application for scoring mouse IVDs (select references)
27; 28; 29; 30
13; 20; 31
30; 34
33; 34
37; 38; 39; 40; 42
34; 41
TABLE 2
Fleiss's multi‐rater kappa (κ) to test inter‐rater reliability of trained and novice raters for the proposed 14 histopathological features
Features
κ
95% CI
κ
95% CI
κ
95% CI
κ
95% CI
κ
95% CI
Overall
LB
UB
P
Score‐0
LB
UB
P
Score‐1
LB
UB
P
Score‐2
LB
UB
P
Score‐3
LB
UB
P
Experienced raters (208 IVDs, 2 raters)—Set 1
NP Cellularity
0.74
0.65
0.82
.00
0.73
0.59
0.87
.00
0.64
0.51
0.78
.00
0.54
0.41
0.68
.00
0.90
0.76
1.04
.00
NP Fibrosis
0.69
0.60
0.77
.00
0.58
0.45
0.72
.00
0.64
0.50
0.78
.00
0.42
0.28
0.55
.00
0.90
0.76
1.03
.00
NP ECM
0.61
0.52
0.69
.00
0.58
0.44
0.71
.00
0.35
0.22
0.49
.00
0.35
0.21
0.49
.00
0.90
0.76
1.04
.00
AF Cellularity
0.64
0.55
0.73
.00
0.78
0.65
0.92
.00
0.40
0.26
0.54
.00
0.53
0.39
0.66
.00
0.74
0.60
0.87
.00
AF Bulging
0.69
0.60
0.78
.00
0.88
0.75
1.02
.00
0.41
0.28
0.55
.00
0.53
0.40
0.67
.00
0.77
0.64
0.91
.00
AF Lamellae
0.68
0.60
0.76
.00
0.90
0.77
1.04
.00
0.63
0.50
0.77
.00
0.50
0.36
0.63
.00
0.53
0.39
0.67
.00
AF Clefts/ fissures
0.73
0.65
0.82
.00
0.85
0.71
0.98
.00
0.61
0.47
0.74
.00
0.49
0.35
0.62
.00
0.89
0.76
1.03
.00
EP Cellularity
0.81
0.69
0.93
.00
0.91
0.78
1.05
.00
0.09
−0.05
0.23
.19
0.83
0.70
0.97
.00
EP Fissures
0.67
0.57
0.77
.00
0.82
0.68
0.96
.00
0.45
0.32
0.59
.00
0.67
0.53
0.81
.00
Schmorl's node
0.86
0.72
0.99
.00
0.85
0.72
0.99
.00
0.85
0.72
0.99
.00
Interface Cellularity
0.92
0.79
1.04
.00
0.95
0.81
1.08
.00
0.49
0.35
0.62
.00
0.94
0.81
1.08
.00
NP‐AF boundary
0.90
0.79
1.01
.00
0.95
0.82
1.09
.00
0.70
0.57
0.84
.00
0.91
0.78
1.05
.00
NP‐EP boundary
0.79
0.68
0.90
.00
0.83
0.69
0.96
.00
0.46
0.33
0.60
.00
0.85
0.72
0.99
.00
AF to EP disruption
0.76
0.66
0.87
.00
0.93
0.79
1.06
.00
0.65
0.51
0.78
.00
0.62
0.49
0.76
.00
Novice raters (208 IVDs, 2 raters)—Set 1
NP Cellularity
0.63
0.54
0.72
.00
0.68
0.54
0.81
.00
0.09
−0.04
0.23
.18
0.45
0.31
0.58
.00
0.88
0.74
1.02
.00
NP Fibrosis
0.56
0.47
0.65
.00
0.77
0.64
0.91
.00
0.24
0.10
0.37
.00
0.05
−0.09
0.18
.50
0.65
0.52
0.79
.00
NP ECM
0.51
0.42
0.60
.00
0.72
0.59
0.86
.00
0.16
0.03
0.30
.02
0.01
−0.13
0.15
.89
0.61
0.48
0.75
.00
AF Cellularity
0.43
0.34
0.52
.00
0.62
0.49
0.76
.00
0.00
−0.14
0.14
.99
−0.01
−0.14
0.13
.93
0.55
0.42
0.69
.00
AF Bulging
0.46
0.37
0.55
.00
0.60
0.47
0.74
.00
0.02
−0.12
0.16
.76
−0.01
−0.14
0.13
.94
0.68
0.55
0.82
.00
AF Lamellae
0.36
0.28
0.44
.00
0.63
0.49
0.77
.00
0.06
−0.07
0.20
.35
0.03
−0.11
0.16
.69
0.42
0.29
0.56
.00
AF Clefts/ fissures
0.38
0.29
0.46
.00
0.63
0.50
0.77
.00
0.02
−0.12
0.16
.76
0.21
0.08
0.35
.00
0.41
0.28
0.55
.00
EP Cellularity
0.53
0.42
0.63
.00
0.64
0.51
0.78
.00
0.30
0.16
0.43
.00
0.53
0.40
0.67
.00
EP Fissures
0.39
0.29
0.49
.00
0.56
0.42
0.69
.00
0.08
−0.06
0.21
.26
0.41
0.28
0.55
.00
Schmorl's node
0.50
0.36
0.63
.00
0.50
0.36
0.63
.00
0.50
0.36
0.63
.00
Interface cellularity
0.65
0.55
0.75
.00
0.77
0.63
0.90
.00
0.27
0.13
0.41
.00
0.75
0.61
0.88
.00
NP‐AF boundary
0.76
0.66
0.86
.00
0.85
0.72
0.99
.00
0.47
0.34
0.61
.00
0.84
0.70
0.97
.00
NP‐EP boundary
0.57
0.47
0.68
.00
0.73
0.60
0.87
.00
0.09
−0.05
0.23
.19
0.67
0.53
0.80
.00
AF to EP disruption
0.42
0.32
0.52
.00
0.59
0.45
0.72
.00
0.01
−0.13
0.15
.89
0.53
0.40
0.67
.00
Novice raters (75 IVDs, 2 rater)—Set 4
NP Cellularity
0.74
0.59
0.89
.00
0.84
0.61
1.06
.00
0.37
0.14
0.60
.00
0.53
0.30
0.75
.00
0.92
0.70
1.15
.00
NP Fibrosis
0.70
0.55
0.86
.00
0.78
0.55
1.01
.00
0.37
0.14
0.60
.00
0.36
0.13
0.58
.00
0.92
0.70
1.15
.00
NP ECM
0.79
0.64
0.95
.00
0.86
0.64
1.09
.00
0.58
0.35
0.81
.00
0.63
0.40
0.86
.00
0.88
0.66
1.11
.00
AF Cellularity
0.75
0.60
0.90
.00
0.92
0.69
1.14
.00
0.33
0.11
0.56
.00
0.37
0.14
0.60
.00
0.92
0.69
1.15
.00
AF Bulging
0.72
0.57
0.88
.00
0.75
0.53
0.98
.00
0.54
0.31
0.76
.00
0.38
0.15
0.61
.00
0.92
0.69
1.14
.00
AF Lamellae
0.68
0.52
0.83
.00
0.85
0.63
1.08
.00
0.37
0.14
0.60
.00
0.12
−0.11
0.34
.31
0.82
0.60
1.05
.00
AF Clefts/ fissures
0.67
0.52
0.81
.00
0.86
0.64
1.09
.00
0.46
0.23
0.68
.00
0.21
−0.01
0.44
.07
0.75
0.52
0.98
.00
EP Cellularity
0.85
0.68
1.02
.00
0.94
0.72
1.17
.00
0.74
0.51
0.96
.00
0.82
0.60
1.05
.00
EP Fissures
0.74
0.57
0.91
.00
0.88
0.66
1.11
.00
0.62
0.39
0.84
.00
0.64
0.41
0.87
.00
Schmorl's node
1.00
0.77
1.23
.00
1.00
0.77
1.23
.00
1.00
0.77
1.23
.00
Interface Cellularity
0.77
0.60
0.94
.00
0.89
0.67
1.12
.00
0.56
0.34
0.79
.00
0.78
0.55
1.01
.00
NP‐AF boundary
0.78
0.61
0.95
.00
0.89
0.66
1.11
.00
0.59
0.37
0.82
.00
0.80
0.57
1.02
.00
NP‐EP boundary
0.68
0.50
0.86
.00
0.78
0.55
1.00
.00
0.24
0.02
0.47
.04
0.76
0.54
0.99
.00
AF to EP disruption
0.77
0.60
0.95
.00
0.88
0.65
1.10
.00
0.58
0.35
0.80
.00
0.77
0.55
1.00
.00
Note: P value of less than .0001 is indicated as .00.
Summary of published histopathological scoring systems. The chart shows features analyzed and scoring range from the listed histopathological scoring systemsSummary of the previous IVD histopathological scoring systems utilized for grading mouse IVDsFleiss's multi‐rater kappa (κ) to test inter‐rater reliability of trained and novice raters for the proposed 14 histopathological featuresNote: P value of less than .0001 is indicated as .00.
Survey to capture feedback of spine community
Next, to capture the opinion of the spine community regarding histopathological features and scoring criteria for the mouse IVD, a detailed survey was designed. The survey was sent out through ORS Spine Section to ~260 spine researchers and an additional ~10 other spine researchers. Forty‐two respondents representing 29 laboratories from around the world (Figure S1A) participated in the survey. However, the survey had over‐representation by one lab (Figure S1A).A multiple‐choice questionnaire captured the commonly used SOPs for histopathological preparation of mouse IVD samples. Results show that the lumbar (37.04%) and caudal (32.51%) IVDs are the commonly studied spine regions (Figure 4A), processed either by paraffin embedding (49.09%) or for cryosectioning (36.36%) (Figure 4B), sectioned at 5 to 20 μm thickness and mostly in sagittal (45.59%) or coronal (38.24%) plane (Figure 4C). One of the respondents mentioned the use of custom 3‐D histology. Safranin‐O, Fast Green & hematoxylin (SafO/Fast green/H) (32.31%), and hematoxylin and eosin (H&E, 31%) were the commonly used histochemical stains (Figure 4D).
FIGURE 4
Survey results. Pie charts show percentage response to each category of multiple‐choice questionnaire related to the region of the spine (A), histological preparation (B), the plane of section (C), and histochemical stain (D) commonly used for mouse IVD research. Component band chart show percentage of response to each category on a six‐point Likert scale to questions related to the importance of histological features for pathological grading of specific IVD region (E). Histograms show percentage responses to multiple‐choice questions regarding specific features for scoring NP (f and g) and AF (H). Component band chart show percentage response to close‐ended questions regarding various criteria (I) and scoring range (J) for development of the new scoring system. Histograms show percentage response to multiple‐choice questions regarding future consensus study regarding methods for mouse IVDs (K). NR, not responded (E)
Survey results. Pie charts show percentage response to each category of multiple‐choice questionnaire related to the region of the spine (A), histological preparation (B), the plane of section (C), and histochemical stain (D) commonly used for mouse IVD research. Component band chart show percentage of response to each category on a six‐point Likert scale to questions related to the importance of histological features for pathological grading of specific IVD region (E). Histograms show percentage responses to multiple‐choice questions regarding specific features for scoring NP (f and g) and AF (H). Component band chart show percentage response to close‐ended questions regarding various criteria (I) and scoring range (J) for development of the new scoring system. Histograms show percentage response to multiple‐choice questions regarding future consensus study regarding methods for mouse IVDs (K). NR, not responded (E)Based on the previous scoring systems for rodent and human IVDs (Figure 3), and pathologies reported in mouse IVDs,
,
,
,
,
,
,
,
,
,
a list of scorable histopathological features were included in the survey. The percentage response on a six‐point Likert scale (0, least important and 5, most important) shows that features of NP morphology, cellularity, and fibrosis were considered important (Figure 4E). Moreover, clusters of NP cells (93%), absence/loss of NP cells (83%), number of NP cells (69%), and evenly spread NP cells (67%) were noted as critical features of NP morphology and cellularity (Figure 4F). Matrix disorganization (74%), scar formation and tissue granulation (60%) were noted as key features of NP fibrosis (Figure 4G). Important scorable features of the AF included clefts/fissures, lamellar organization, as well as outward and inward bulging of the AF (Figure 4E). Inclusion of neovascularization of the AF in histopathological scoring was debated, as routine histopathological methods may be insufficient to visualize neovascularization, requiring instead specific staining and methodologies. Enthusiasm to score inner and out AF separately was noted (~60%, Figure 4H). The key features to consider for scoring the EP region included calcification, cartilage disorganization, fibrocartilage, Schmorl's nodes, microfractures/fissure, height/thickness, and the number of EP cells (Figure 4E). Regarding interface features, loss of demarcation between NP and AF, followed by disruption of AF lamellae into the EP and loss of NP and EP boundary were considered important (Figure 4E).Close‐ended questions regarding scoring criteria showed that most respondents preferred a separate score for each disc region (83%), to generate a cumulative score (71%), and to compare specific levels of the IVD in the spine (76%) (Figure 4I). Inclusion of staining intensity towards the histopathological score and scoring each EP region was not preferred (Figure 4I). The scoring range for each IVD region received mixed responses for 0 to 5 (33.3%), 0 to 3 (31%), and 0 to 4 (23.8%) (Figure 4J).Regarding opinions for additional outcome measures for future consensus methods for assessment of mouse IVDs, showed highest enthusiasm was reported for assays for ECM content (64.3%), gene expression analysis (61.9%), and disc‐height index (52.4%) (Figure 4K).
List of histological features and scoring categories to quantify mouse IVD pathologies
A new mouse IVD histopathological scoring criterion was developed taking into consideration the naturally occurring mouse IVD pathologies, the previous scoring systems and feedback received from the spine community. Histopathological features for scoring mouse IVDs were classified using a point‐based ordinal scale of equal intervals (0, 1, 2, and 3) to separately grade NP, AF, EP, and the interphase regions (Figures 5, 6, 7, 8). The categories are linearly ordered with a score of 0 representing a normal structure, an increase in number scores increased histopathology, with the highest score indicating severe degeneration. Following discussions on the list of identifiable features, the organization of features for each category, that together inform the linear order of degenerative changes, and after initial test‐run (Data S1), it was decided that NP and AF could be categorized on a 4‐point scale (0‐3); however, EP and interphase could be categorized on a 3‐point scale (0‐2).
FIGURE 5
Histopathological scoring of mouse NP region. List of features, detailed criteria specific to each scoring category (0, 1, 2, and 3), and two representative images specific for each category (A‐D′) for histopathological scoring of NP region of the mouse IVD. H&E‐stained images in the coronal plane. Scale bar = 100 μm
FIGURE 6
Histopathological scoring of mouse AF region. List of features, detailed criteria specific to each scoring category (0, 1, 2, and 3), and two representative images specific for each category (A‐D′) for histopathological scoring of AF region of the mouse IVD. H&E‐stained images in the coronal plane. Scale bar = 100 μm
FIGURE 7
Histopathological scoring of mouse EP region. List of features, detailed criteria specific to each scoring category (0, 1, and 2), and two representative images specific for each category (A‐C) for histopathological scoring of EP region of mouse IVD. H&E‐stained images in the coronal plane. Scale bar = 100 μm
FIGURE 8
Histopathological scoring of mouse IVD interface region. List of features, detailed criteria specific to each scoring category (0, 1, and 2), and two representative images specific for each category (A‐C′) for histopathological scoring of the interface region between each compartment of mouse IVD. H&E‐stained images in the coronal plane. Scale bar = 100 μm
Nucleus pulposus: Three critical features considered for scoring NP region include cellularity and morphology, fibrosis, and matrix organization (Figure 5). Cellularity and morphology are scored on the shape, presence of lacunae, and relative quantity of the NP cells. The presence of fibrous lamella between cells and in NP space is used to score fibrosis. Matrix organization is scored considering consolidation into clumps and disorganization. NP tissue with features such as cell loss, fibrous lamella, and matrix disorganization is considered severely degenerated.Annulus fibrosus: Four crucial features considered for scoring AF included cellularity, bulging, lamellar organization, and clefts/fissures (Figure 6). Histopathological scoring of AF includes a change in cell shape progressing from inner to outer AF, protrusion or bugling of AF both inwards and outwards, disorganization or loss of AF lamella and structure, and presence of clefts and fissures between AF lamellae. A higher score for each feature indicates progression towards degeneration. AF cells can be distinguished from NP at the boundary by their presence in lamellae, which are absent for NP cells.Endplate: Three features for scoring the EP region included cellularity, fissures/ microfractures, and the presence of Schmorl's nodes (Figure 7). Cellularity was scored based on EP cells in defined layers and not in lacunae. The EPs that showing increased cellular disorganization, with fissures and microfractures and Schmorl's nodes, would receive higher scores. Schmorl's nodes are scored as either absent (0) or present (2).Interface: Features scored at the interface included cellularity, NP‐AF boundary, NP‐EP boundary, and the AF lamella disruption into the EP (Figure 8). The presence of cells in their respective compartments or at the border and in lacunae scored the cellularity at the interface. IVDs that show undefined boundaries between each compartment would receive higher scores.Histopathological scoring of mouse NP region. List of features, detailed criteria specific to each scoring category (0, 1, 2, and 3), and two representative images specific for each category (A‐D′) for histopathological scoring of NP region of the mouse IVD. H&E‐stained images in the coronal plane. Scale bar = 100 μmHistopathological scoring of mouse AF region. List of features, detailed criteria specific to each scoring category (0, 1, 2, and 3), and two representative images specific for each category (A‐D′) for histopathological scoring of AF region of the mouse IVD. H&E‐stained images in the coronal plane. Scale bar = 100 μmHistopathological scoring of mouse EP region. List of features, detailed criteria specific to each scoring category (0, 1, and 2), and two representative images specific for each category (A‐C) for histopathological scoring of EP region of mouse IVD. H&E‐stained images in the coronal plane. Scale bar = 100 μmHistopathological scoring of mouse IVD interface region. List of features, detailed criteria specific to each scoring category (0, 1, and 2), and two representative images specific for each category (A‐C′) for histopathological scoring of the interface region between each compartment of mouse IVD. H&E‐stained images in the coronal plane. Scale bar = 100 μm
Guidance on scoring range and adaptation
Overall, 14 features were listed for histopathological scoring of the mouse IVDs. Based on discussions during the development of the new scoring criteria, and feedback received from the survey participants, it was agreed that there are some basic SOPs and controls that should be considered during experimental design for histopathological analysis.All features within each IVD region being analyzed should be scored.Adding the scores for features within a specific IVD region will inform about the pathology of that region where highest score will be 9 for NP, 12 for AF, 6 for EP, and 8 for the interface (Figure 9A).
FIGURE 9
Scoring range and interpretation. Stacked histogram for the 14 features shows the scoring range and interpretation of the scoring category (normal, mild, moderate, and severe) for each IVD region (A) or the entire disc (B)
Total scores from each region can be combined to generate a cumulative score for the entire IVD where the maximum total score of a severely degenerated IVD will be 35 (Figure 9B). By adding scores from each IVD region and considering a range (mean score ± 30%), we propose scoring range to classify normal (0‐6), mild (7–13), moderate (14–25), and severe IVD degeneration (26–35) (Figure 9B).IVDs between cohorts should be analyzed from the same spine level, and sections should be prepared using the same SOP (fixation, serial sections, plane, and thickness of section).Sections only from the mid‐plane region should be analyzed for histopathological grading.Slides from all biological replicates from each cohort should be stained at the same time.Comparisons should be done using age‐matched littermate controls for genetic studies and surgical models.A significantly higher histopathological score of IVDs belonging to the experimental cohort compared to littermate controls, or to a younger mouse IVD for studies of natural aging, should be used to quantify the degree of degeneration.A significantly lower histopathology score for the IVDs belonging to the regenerative cohort compared to age‐matched littermate controls may inform on the extent of prevention of degeneration or regeneration.All raters should be trained on the new scoring criteria and have to substantial or almost perfect agreement (Fleiss's κ greater than 0.61) before proceeding to scoring of experimental samples.Atleast two raters, who are blinded to the experimental conditions should independently score each image. Average of the two raters should be used for further analysis.Scoring range and interpretation. Stacked histogram for the 14 features shows the scoring range and interpretation of the scoring category (normal, mild, moderate, and severe) for each IVD region (A) or the entire disc (B)
Structure features and preparation artifacts not to be interpreted as IVD pathologies
When scoring mouse IVD histopathology, the following normal structures and artifacts due to histological preparation and staining processes listed below should not be scored.The AF layers continue to align and organize collagenous lamellae during early postnatal development, and cells appear rounder evident in IVD tissues from mice at P7 (Figure 10A), ~1 M (Figure 10C), and ~2 M (Figure 10D). These normal structures should not be misinterpreted as loss of organization of AF lamellae or loss of cellularity with round cells observed during pathology, when AF cells no longer align in layers and may reside in lacunae.
FIGURE 10
Normal features and technical artifacts for consideration. H&E stained images of mouse IVDs sectioned in coronal plane. The midline in the IVD formed during embryonic development and formation of axial skeleton is visible as a notch in the center of EP () in P7 (A) and 24 M (B) old mouse lumbar discs, which is a normal feature. The immature AF in neonatal mouse IVDs do not have fully organized layers () and cells appear rounder as shown in P7 (A), ∼1 M (C), and ∼2 M (D) old mouse lumbar discs, which is a normal feature. During neonatal development, the AF lamella continues to integrate into the EP as shown by in P7 (A) ∼1 M (C), and ∼2 M (D) old mouse lumbar discs. The separation of entire AF lamellae (↓, C and D) parallel to the adjacent lamella which is otherwise cellular could be due to technical artifacts and are not features of clefts and fissures. Cracks or tears in the EP () may occur due to technical artifacts and are not features of micro‐fracture or fissures (E). Mineralization of sacral disc during adolescence (∼1 M of age in mice) is normal part of spine development (F, ∼2 M old). Clumping of NP cells into central mass can occur due to improper fixation and embedding (G), and is not a feature of NP pathology. Scale bar = 100 μm
During early neonatal development, as the AF lamella organize, they continue to integrate into the EP as evident in IVD tissues from mice at P7 (Figure 10A) ~1 M (Figure 10C), and ~2 M (Figure 10D), and this process continues until skeletal maturity. Hence, the lack of distinction between AF and EP in developing IVDs should not be mistaken as loss of demarcation/ boundary due to IVD pathology.The separation of the entire AF lamellae (Figure 10C,D) parallel to the adjacent lamella, which is otherwise cellular, could be due to technical artifacts and should not be scored as clefts and fissures.The midline in the disc formed at the site where the left and right sclerotome merged during the development and formation of the axial skeleton and continues to be visible as a notch shown in the EP, as evident in IVD from mice at P7 (Figure 10A) and 24 M (Figure 10B) old mouse lumbar IVDs. This notch‐like feature evident in the mid‐coronal sections is a normal feature and should not be considered as Schmorl's node, fissure, or micro‐fracture in the EP.Large cracks or tears in the EP may occur due to histological artifacts, which will be large and empty, and should not be scored as micro‐fractures or fissures (Figure 10E). Schmorl's node shows fibrous matrix infiltration from NP region into the EP and extends to the vertebra GP.When scoring IVD pathologies, raters should distinguish the sacral IVD from the other regions of the spine. Mineralization of the sacral IVD during adolescence (~1 M of age in mice
) is a normal part of spine development (Figure 10F, ~2 M old). Such mineralization features and vascular invasion are not observed even until ~30 M of age in the IVDs from the other spine regions. Hence, sacral IVDs should not be included in comparisons while grading IVDs from the cervical, thoracic, lumbar, and coccygeal spine.Clumping of NP cells into a central mass can occur due to improper fixation and embedding (Figure 10G) and should be carefully evaluated.IVDs of same spinal level within the same spine region should be compared in histological analysis.Normal features and technical artifacts for consideration. H&E stained images of mouse IVDs sectioned in coronal plane. The midline in the IVD formed during embryonic development and formation of axial skeleton is visible as a notch in the center of EP () in P7 (A) and 24 M (B) old mouse lumbar discs, which is a normal feature. The immature AF in neonatal mouse IVDs do not have fully organized layers () and cells appear rounder as shown in P7 (A), ∼1 M (C), and ∼2 M (D) old mouse lumbar discs, which is a normal feature. During neonatal development, the AF lamella continues to integrate into the EP as shown by in P7 (A) ∼1 M (C), and ∼2 M (D) old mouse lumbar discs. The separation of entire AF lamellae (↓, C and D) parallel to the adjacent lamella which is otherwise cellular could be due to technical artifacts and are not features of clefts and fissures. Cracks or tears in the EP () may occur due to technical artifacts and are not features of micro‐fracture or fissures (E). Mineralization of sacral disc during adolescence (∼1 M of age in mice) is normal part of spine development (F, ∼2 M old). Clumping of NP cells into central mass can occur due to improper fixation and embedding (G), and is not a feature of NP pathology. Scale bar = 100 μm
Test‐run to check the reliability of scoring criteria for mouse IVD pathologies
Description of models utilized and raters
The 14 histopathological features and scoring criteria for quantitative evaluation of mouse IVD histopathology were tested using images of 214 individual mouse IVDs collected from seven different laboratories. Scoring was carried out using digital images and not on actual histological slides. The images represented various histological methods, mouse strains, ages, and IVD degeneration models (Figure 11). Moreover, the IVD images were captured at various magnifications, which also tested whether the sections needed to be analyzed under a microscope to observe the features described in the scoring method.
FIGURE 11
Samples employed for testing the scoring criteria. Cross‐tabulation results plotted as multi‐layered donut where each of the nine layers shows the frequency distribution of samples in each factor (or variable) used to test the new Mouse intErveRtebral disC histopathologY scoring criteria
Samples employed for testing the scoring criteria. Cross‐tabulation results plotted as multi‐layered donut where each of the nine layers shows the frequency distribution of samples in each factor (or variable) used to test the new Mouse intErveRtebral disC histopathologY scoring criteria
Testing inter‐rater agreement for the histopathological scoring features
The 14 features were scored on 214 de‐identified IVD images by 12 blinded and independent raters with varying academic background and experience evaluating mouse IVD pathologies, representing seven different labs (Figure 12A). Six images reported to have poor resolution were removed, and agreement results are based on scores of 208 de‐identified IVD images only. The histopathological scores were analyzed for agreement using Fleiss' multi‐rater kappa (κ) test for reliability. As most labs may use only two raters for histopathological scoring studies, first, we tested the inter‐rater agreement between a set of two blinded independent raters who scored the same images. Scoring results from two experienced (or trained) raters from Lab‐A, and two novice raters from Lab‐B were analyzed for agreement (Figure 12B, and Table 2). Results show substantial to almost perfect overall agreement (κ) by experienced raters (criteria per Reference 49). The novice raters had fair, moderate and substantial overall κ between different categories. Detailed analysis of each scoring category (0‐2/3) showed substantial to almost perfect κ values for normal structure (category 0), and the most degenerative category (three for NP and AF, and two for EP and interface) irrespective of training for all 14 features. Fair to a moderate κ values were observed for the middle categories of mild to moderate (one and two for NP and AF, and one for EP and interface) IVD degeneration (Figure 12B, and Table 2). Next, the novice raters were trained by the faculty member by reviewing each of the 14 features for all scoring grades and how to distinguish them using random images of mouse IVDs from normal and degeneration models. Then we tested whether training could improve inter‐rater agreement of novice raters to substantial or almost perfect agreement; and, if so, how many rounds of training were required. Novice raters were trained on 75 de‐identified images. At the fourth round of scoring the Fleiss's κ test showed a dramatic improvement, with substantial to almost perfect overall κ for all features between the raters, and for most features in individual scoring category (Figure 12B, and Table 2). A few features for scoring category of 1 and 2 continued to have only fair agreement, which might have improved further with training.
FIGURE 12
Reliability test of the new Mouse intErveRtebral disC histopathologY scoring system. Cross‐tabulation results plotted as multi‐layered donut showing the frequency distribution of raters that tested the scoring criteria (A). The heat map shows the results of Fleiss's kappa (κ) test for inter‐rater (B and C) and intra‐rater (E) reliability. A stacked bar chart shows the relative percentage of higher κ scores for histopathological features between each set of comparison including H&E, SafraninO/Fast green and hematoxylin (SafO), and FAST stained IVDs images (D)
Reliability test of the new Mouse intErveRtebral disC histopathologY scoring system. Cross‐tabulation results plotted as multi‐layered donut showing the frequency distribution of raters that tested the scoring criteria (A). The heat map shows the results of Fleiss's kappa (κ) test for inter‐rater (B and C) and intra‐rater (E) reliability. A stacked bar chart shows the relative percentage of higher κ scores for histopathological features between each set of comparison including H&E, SafraninO/Fast green and hematoxylin (SafO), and FAST stained IVDs images (D)
Effect of histochemical stains on inter‐rater agreement
As the survey showed mixed responses for choice of histochemical staining, next we compared the reliability of the 14 features using mouse IVD images prepared using three different histological staining techniques. The inter‐rater agreement was calculated using Fleiss's κ and overall agreement was analyzed (Figure 12C, and Table 3). First, the κ was calculated between all raters, experienced raters, and novice raters who scored the same 208 images of mouse IVDs. One experienced rater did not score six images due to conflict; hence, the number of images was reduced from 208 to 202 for analysis in the all‐raters and experienced rater categories. Next, we tested reliability for features based on histochemical stain and compared data from images of mouse IVDs sections stained with H&E (41 images), SafO/Fast green/H, (44 images), and FAST (36 images) (Figure 12C, and Table 3). Relative comparison of higher multi‐rater κ values for the 14 histopathological features between the three histochemical stains shows highest relative agreement for H&E‐stained images for 12 out of 14 features (85.7%) by all raters, 9 out of 14 features (64.3%) by experienced raters, and 11 out of 14 features (78.6%) by novice raters compared to SafO/Fast green/H‐stained images scored by the same raters (Figure 12D, Table 3). Agreement for IVD images stained with SafO/Fast green/H was higher for 10 out of 14 features (71.4%) by all raters, 10 out of 14 features (71.4%) by experienced rater, and 8 out of 14 features (57.1%) for novice raters compared to sections stained with FAST scored by the same raters (Figure 12D, Table 3). Agreement with FAST‐stained images was higher than H&E for only one out of 14 features (7.14%) in the experienced rater's category only.
TABLE 3
Fleiss's multi‐rater kappa (κ) to test the effect of histochemical staining on reliability of the proposed 14 histopathological features
All raters
Features
All stains (208 IVDs, 10 raters)
H&E (41 IVDs, 12 raters)
SafO/FG/H (44 IVDs, 11 raters)
FAST (36 IVDs, 12 raters)
κ
95% CI
κ
95% CI
κ
95% CI
κ
95% CI
Overall
LB
UB
P
Overall
LB
UB
P
Overall
LB
UB
P
Overall
LB
UB
P
NP Cellularity
0.54
0.53
0.56
.00
0.53
0.51
0.56
.00
0.46
0.44
0.49
.00
0.40
0.37
0.42
.00
NP Fibrosis
0.49
0.48
0.50
.00
0.45
0.42
0.47
.00
0.44
0.41
0.47
.00
0.35
0.33
0.37
.00
NP ECM
0.51
0.49
0.52
.00
0.50
0.48
0.52
.00
0.43
0.40
0.46
.00
0.37
0.34
0.39
.00
AF Cellularity
0.36
0.35
0.38
.00
0.45
0.43
0.48
.00
0.35
0.32
0.37
.00
0.32
0.29
0.34
.00
AF Bulging
0.28
0.27
0.30
.00
0.44
0.41
0.46
.00
0.22
0.20
0.25
.00
0.29
0.27
0.31
.00
AF Lamellae
0.31
0.30
0.32
.00
0.40
0.37
0.42
.00
0.24
0.21
0.26
.00
0.30
0.28
0.32
.00
AF Clefts/ fissures
0.24
0.22
0.25
.00
0.28
0.26
0.30
.00
0.21
0.18
0.23
.00
0.29
0.27
0.32
.00
EP Cellularity
0.32
0.31
0.34
.00
0.37
0.35
0.40
.00
0.33
0.30
0.36
.00
0.23
0.20
0.25
.00
EP Fissures
0.18
0.17
0.20
.00
0.30
0.28
0.33
.00
0.11
0.08
0.14
.00
0.16
0.13
0.19
.00
Schmorl's node
0.18
0.16
0.20
.00
0.61
0.57
0.65
.00
0.14
0.09
0.18
.00
0.00
−0.05
0.04
.93
Interface Cellularity
0.46
0.44
0.47
.00
0.39
0.36
0.41
.00
0.40
0.37
0.43
.00
0.29
0.26
0.32
.00
NP‐AF boundary
0.61
0.59
0.62
.00
0.58
0.55
0.61
.00
0.54
0.51
0.56
.00
0.42
0.39
0.44
.00
NP‐EP boundary
0.50
0.48
0.51
.00
0.35
0.32
0.38
.00
0.56
0.53
0.59
.00
0.34
0.31
0.37
.00
AF to EP disruption
0.31
0.30
0.33
.00
0.37
0.34
0.39
.00
0.30
0.27
0.33
.00
0.28
0.25
0.31
.00
Note: P value of less than .0001 is indicated as .00.
Abbreviation: SafO/FG/H, Safranin‐O/ Fast green and hematoxylin.
Fleiss's multi‐rater kappa (κ) to test the effect of histochemical staining on reliability of the proposed 14 histopathological featuresNote: P value of less than .0001 is indicated as .00.Abbreviation: SafO/FG/H, Safranin‐O/ Fast green and hematoxylin.
Magnitude of agreement between raters for the histopathological features
Additional algorithms were used to determine the magnitude of agreement between raters for observational data including Cohen's weighted kappa (κw) and the intra‐class correlation coefficient (ICC). These algorithms were employed in previous IVD histopathological reliability studies (Table 1). We compared the reliability of the 14 features listed in this study using Cohen's κw and ICC, allowing comparison of our scoring criteria with previous scoring methods (Table 1). The ICC results show excellent agreement for EP fractures and Schmorl's node and almost perfect agreement for all other 12 features (Table 4). The results of the Cohen's κw indicate excellent and substantial agreement for all 14 features (Table 4). Comparison of the results of the three reliability tests indicates that fair to moderate strength of agreement by Fleiss's κ is similar to excellent strength of agreement by ICC and Cohen's κw tests due to the difference in algorithms employed by each of these tests (Table 4).
TABLE 4
Testing the inter‐rater reliability of the proposed 14 histopathological features by intraclass correlation coefficient (ICC) and Cohen's weighted (κw)
Note: ICC was run using scores of all‐raters for all stains presented for Fleiss's κ in Table 3. Cohen's κw was run using scores of two experienced raters presented for Fleiss's κ in Table 2. P value of less than .0001 is indicated as .00.
Testing the inter‐rater reliability of the proposed 14 histopathological features by intraclass correlation coefficient (ICC) and Cohen's weighted (κw)Note: ICC was run using scores of all‐raters for all stains presented for Fleiss's κ in Table 3. Cohen's κw was run using scores of two experienced raters presented for Fleiss's κ in Table 2. P value of less than .0001 is indicated as .00.
Intra‐rater agreement test for reproducibility
Next, to determine the consistency in observations using the scoring criteria, intra‐rater reliability was tested for two blinded raters who scored the 14 features for the same 75 de‐identified IVD images. The strength of agreement was tested using Fleiss's κ, which shows substantial to almost perfect agreement for overall κ for the 14 features by each rater (Figure 12E, and Table 5), indicating that scoring using the new histopathological method is reproducible. Moreover, substantial to almost perfect agreement was observed for κ of each scoring category by both raters (Table 5).
TABLE 5
Fleiss's kappa (κ) test of agreement for intra‐ rater reliability for the 14 histopathological features
Features
κ
95% CI
κ
95% CI
κ
95% CI
κ
95% CI
κ
95% CI
Overall
LB
UB
P
Score‐0
LB
UB
P
Score‐1
LB
UB
P
Score‐2
LB
UB
P
Score‐3
LB
UB
P
Rater 1
NP Cellularity
0.94
0.79
1.08
.00
0.97
0.75
1.20
.00
0.91
0.68
1.13
.00
0.84
0.62
1.07
.00
0.96
0.73
1.19
.00
NP Fibrosis
0.87
0.72
1.02
.00
0.89
0.67
1.12
.00
0.75
0.52
0.97
.00
0.84
0.62
1.07
.00
0.92
0.69
1.15
.00
NP ECM
0.91
0.76
1.06
.00
0.95
0.72
1.17
.00
0.84
0.62
1.07
.00
0.86
0.63
1.09
.00
0.92
0.69
1.15
.00
AF Cellularity
0.79
0.64
0.94
.00
0.94
0.72
1.17
.00
0.46
0.23
0.68
.00
0.41
0.18
0.64
.00
0.89
0.66
1.12
.00
AF Bulging
0.85
0.69
1.00
.00
0.92
0.69
1.15
.00
0.75
0.52
0.98
.00
0.31
0.08
0.53
.00
0.96
0.73
1.19
.00
AF Lamellae
0.86
0.71
1.01
.00
0.89
0.65
1.11
.00
0.69
0.46
0.91
.00
0.84
0.62
1.07
.00
0.91
0.69
1.14
.00
AF Clefts/ fissures
0.91
0.77
1.06
.00
0.92
0.69
1.15
.00
0.80
0.58
1.03
.00
1.00
0.77
1.23
.00
0.96
0.73
1.18
.00
EP Cellularity
0.88
0.71
1.04
.00
0.97
0.75
1.20
.00
0.77
0.55
1.00
.00
0.82
0.60
1.05
.00
EP Fissures
0.85
0.68
1.01
.00
0.91
0.69
1.14
.00
0.74
0.51
0.96
.00
0.86
0.63
1.08
.00
Schmorl's node
1.00
0.77
1.23
.00
1.00
0.77
1.23
.00
1.00
0.77
1.23
.00
Interface Cellularity
0.86
0.70
1.03
.00
0.95
0.72
1.17
.00
0.72
0.50
0.95
.00
0.86
0.64
1.09
.00
NP‐AF boundary
0.84
0.67
1.01
.00
0.85
0.63
1.08
.00
0.62
0.40
0.85
.00
0.96
0.74
1.19
.00
NP‐EP boundary
0.90
0.71
1.10
.00
0.93
0.71
1.16
.00
0.55
0.32
0.78
.00
0.96
0.74
1.19
.00
AF to EP disruption
0.85
0.68
1.03
.00
0.90
0.68
1.13
.00
0.63
0.40
0.86
.00
0.91
0.69
1.14
.00
Rater 2
NP Cellularity
0.91
0.75
1.06
.00
0.94
0.72
1.17
.00
0.65
0.42
0.87
.00
0.86
0.63
1.09
.00
0.96
0.74
1.19
.00
NP Fibrosis
0.88
0.72
1.03
.00
0.94
0.72
1.17
.00
0.75
0.52
0.97
.00
0.57
0.35
0.80
.00
0.96
0.73
1.19
.00
NP ECM
0.93
0.78
1.08
.00
0.97
0.75
1.20
.00
0.82
0.59
1.05
.00
0.86
0.63
1.09
.00
0.96
0.73
1.19
.00
AF Cellularity
0.91
0.76
1.06
.00
1.00
0.77
1.23
.00
0.94
0.71
1.17
.00
0.64
0.41
0.86
.00
0.87
0.65
1.10
.00
AF Bulging
0.79
0.64
0.94
.00
0.80
0.58
1.03
.00
0.77
0.54
1.00
.00
0.57
0.35
0.80
.00
0.86
0.64
1.09
.00
AF Lamellae
0.83
0.68
0.98
.00
0.89
0.66
1.11
.00
0.77
0.54
1.00
.00
0.47
0.25
0.70
.00
0.91
0.69
1.14
.00
AF Clefts/fissures
0.93
0.78
1.08
.00
0.97
0.75
1.20
.00
0.92
0.69
1.14
.00
0.80
0.57
1.03
.00
0.96
0.73
1.19
.00
EP Cellularity
0.93
0.76
1.09
.00
1.00
0.77
1.23
.00
0.86
0.64
1.09
.00
0.87
0.65
1.10
.00
EP Fissures
0.76
0.59
0.94
.00
0.82
0.60
1.05
.00
0.68
0.46
0.91
.00
0.77
0.54
1.00
.00
Schmorl's node
1.00
0.77
1.23
.00
1.00
0.77
1.23
.00
.00
1.00
0.77
1.23
.00
Interface Cellularity
0.98
0.81
1.14
.00
1.00
0.77
1.23
.00
0.96
0.73
1.19
.00
0.96
0.73
1.18
.00
NP‐AF boundary
0.83
0.67
1.00
.00
0.94
0.72
1.17
.00
0.77
0.55
1.00
.00
0.72
0.49
0.95
.00
NP‐EP boundary
0.84
0.67
1.00
.00
0.94
0.71
1.17
.00
0.68
0.45
0.91
.00
0.81
0.59
1.04
.00
AF to EP disruption
0.92
0.75
1.09
.00
0.94
0.71
1.17
.00
0.85
0.62
1.07
.00
0.95
0.73
1.18
.00
Note: P value of less than .0001 is indicated as .00.
Fleiss's kappa (κ) test of agreement for intra‐ rater reliability for the 14 histopathological featuresNote: P value of less than .0001 is indicated as .00.These analyses establish substantial agreement and reliability of the scoring criteria by trained and novice raters using several mouse models representing healthy and degenerated IVDs and from P7 to 28 M of age, while controlling for various factors including sex, age, mouse strain, and SOPs for histological preparation.
Validation of the sensitivity and specificity of the new IVD scoring system by applying machine learning approaches
Next, we validated the sensitivity and specificity of the new mouse IVD histopathological scoring system for predictive modeling using both unsupervised and supervised machine learning algorithms. To do so, scores for 14 features generated by 12 blinded raters for 214 IVD images were used.
Correlation of severity of histopathology based on scoring criteria
Heatmap shows the mean score by 12 raters for the 214 IVDs arranged in columns and in the same order for all 14 histopathological features stacked in rows. A visual correlation between scores of each feature in a given IVD is observed (Figure 13A,B). Schmorl's nodes were identified only in a few IVDs (Figure 13B). Pearson product moment correlation (r) analysis for the relationship between the 14 histopathological features shows positive and statistically significant Pearson's coefficient between all the features of NP, AF, and the interface region (r > .83, P < .000001 for all, Figure 13C, Table S1). While Pearson's coefficient between cellularity and clefts/ fissures in EP was high (r > .7, P < .000001), the strength of correlation of the 13 histopathological features with that for Schmorl's node was relatively lower (r ~.36 to .5), but positive, and significant (P < .000001). The lower r between the 13 histopathological features and EP Schmorl's node may be due to the rare occurrence of Schmorl's nodes in the mouse IVDs from both lumbar and coccygeal region relative to the other features of IVD pathologies. Overall, the Pearson's coefficient r shows a strong and linear relationship between the 14 histopathological features, and as expected, similar to that observed by ICC (Table 4).
FIGURE 13
Predictive modeling and validation of the new mouse disc histopathological scoring. All charts presented in the sub‐figures are based on all graders' mean scores (n = 12) for all 14 features in 214 mouse IVD images. Each column in A and B represents individual IVD with the heat map for the mean score by all raters (n = 12) for the listed 14 features (in rows) for all samples (214). Data in A and B are organized in the same order, so the mean score for each feature in a given IVD can be visually compared down the column. The data presented in A and B was used for analysis in all sub‐figures. C, Pearson correlation matrix for listed histopathological scoring features (P < .001 for all). Unsupervised machine learning algorithm using k‐means clustering (D and E) and dispersion of samples based on the 14 histopathological features into PCs represented by PC scores determined by principal component analysis (PCA, F), and cross‐validation to class labels (models) and cluster‐membership. A supervised machine‐learning algorithm using artificial neural network (ANN) and multilayer perceptron (MLP) was applied to train 70% data set and test on 30% data set. Predicted probability (G) and area under the ROC curve (H) for the ANN MPL test
Predictive modeling and validation of the new mouse disc histopathological scoring. All charts presented in the sub‐figures are based on all graders' mean scores (n = 12) for all 14 features in 214 mouse IVD images. Each column in A and B represents individual IVD with the heat map for the mean score by all raters (n = 12) for the listed 14 features (in rows) for all samples (214). Data in A and B are organized in the same order, so the mean score for each feature in a given IVD can be visually compared down the column. The data presented in A and B was used for analysis in all sub‐figures. C, Pearson correlation matrix for listed histopathological scoring features (P < .001 for all). Unsupervised machine learning algorithm using k‐means clustering (D and E) and dispersion of samples based on the 14 histopathological features into PCs represented by PC scores determined by principal component analysis (PCA, F), and cross‐validation to class labels (models) and cluster‐membership. A supervised machine‐learning algorithm using artificial neural network (ANN) and multilayer perceptron (MLP) was applied to train 70% data set and test on 30% data set. Predicted probability (G) and area under the ROC curve (H) for the ANN MPL test
Validation of scoring criteria using unsupervised machine learning algorithms
We applied unsupervised machine learning using the k‐means clustering algorithm to test whether the 14 histopathological features (independent variables, mean score of ~12 raters) can partition the 214 IVDs into a “k” number of clusters based on their similarities. Four “k” clusters were determined using TwoStep clustering, and the distance from cluster center was measured using Euclidean distance. Next, using k of four, k‐means clustering determined the final cluster membership between the 214 IVDs and distance of each feature from the cluster center (Figure 13D, Table S2). The number of clusters and their membership were validated using supervised evaluation by analyzing the NP cellularity of these clusters and comparing the results to the class labels (degeneration model to which the IVDs belonged). The four clusters segregated by the score of NP cellularity and matched their respective model are shown in Figure 13E. Controls and neonatal IVDs with normal histopathological features grouped in cluster 4. Aged and needle‐puncture IVDs were grouped in cluster 1. IVDs from models of milder and moderate degeneration including from middle‐age mice were grouped in cluster 3 and 2.Next, using dimension reduction approach like principal component analysis (PCA), we validated the 14 histopathological features for predicting IVD pathologies for the 214 IVD images. PCA was run and two principal components (PCs), PC1 (11.64 eigenvalue, 83.17% variance), and PC2 (0.84 eigenvalue, 6.04% variance), were extracted. PCs were validated using class labels (models) and cluster membership which show that the IVDs from the aged and needle‐puncture models from cluster 1 were closer, but furthest away from cluster 4 members formed by the neonatal and control IVDs (Figure 13F).
Validation of scoring criteria using supervised machine learning algorithms
Supervised deep learning using artificial neural networks (ANN) and multilayer perceptron (MLP) algorithm was applied to test whether grading of IVD based on the 14 listed histopathological features the machine can be trained to correctly predict the health and degeneration of the mouse IVD developed in Figure 9B. The predicted pseudo‐probability chart shows that if the machine is trained using the scores of the 14 features to classify the IVDs into normal, mild, moderate, and severe degeneration, it can predict the classification (health and degeneration) of the IVDs from testing dataset with high accuracy (Figure 13G, Table S3). The receiver operating characteristic (ROC) curve and area under the ROC curve was >0.99 for categories, demonstrating a prediction of the IVD model with high sensitivity and specificity based on the histopathological features developed in the study (Figure 13H, Table S3). Cross validation using Spearman's rho (ρ 0.97, 95% CI 0.96 to 0.98, P < .00001) shows almost perfect correlation between the predicted to actual IVD health and degeneration classification based on the histopathological criteria. Apart from the validation, the ANN results using this limited dataset showed that a machine learning model can be developed using scores provided by human observers on the 14 features to predict the health and degeneration of the mouse IVD paving a way to develop a more robust model, in future, using a large dataset of scores or directly on images.Using unsupervised and supervised machine learning algorithms, we show that the 14 histopathological features and scoring criteria developed in the study can predict the health and degeneration status of the mouse IVDs with high sensitivity and specificity.
Testing the applicability of the new scoring system using models of mouse IVD degeneration
Finally, we analyzed the applicability of the new mouse IVD histopathological scoring system using the images for three different mouse models of IVD degeneration that were part of the 214 IVD images used for testing (Figure 12) and validation (Figure 13).
Application to the tail needle‐puncture model using H&E‐stained images
Coccygeal IVDs of 3 M old male control, one‐day and four‐week post‐needle puncture were sectioned in the sagittal plane and stained with H&E (Figure 14A‐C′).
Two individual IVDs were scored per cohort (11‐12 raters) for the 14 histopathological features listed in the new scoring system. Mean scores plotted in the heat map show a progressive increase in histopathological scores with time following needle‐puncture (Figure 14D) where more dramatic changes were observed in the NP and AF regions.
FIGURE 14
Validation using H&E stained needle‐puncture model. The new histopathological scoring system and listed features were applied to quantify histopathological changes in H&E‐stained sagittal sections of coccygeal IVDs from the needle‐puncture model of ∼3 M old male mice (A‐C′). Scale bar = 200 μm. D, the heat map shows the mean score by 12 raters for the 14 histopathological features on six individual IVDs from the three cohorts. This data was used for analysis in sub‐figures E‐I. E, PC scores for 14 features based on the model determined by PCA analysis. F, k‐mean cluster membership and Euclidean distance from cluster center of the six IVD samples. G, ROC curve and area under the ROC (AUROC) curve with time following needle‐puncture compared to control cohort. Histogram for mixed‐model ANOVA and Tukey's multiple comparison test analyzing individual IVD region per cohort (H), and cumulative score for the entire IVD per cohort (I). Error bar in H and I shows mean ± SD. ns, not significant, * P < .05, ** P < .01, *** P < .001, and **** P < .0001
Validation using H&E stained needle‐puncture model. The new histopathological scoring system and listed features were applied to quantify histopathological changes in H&E‐stained sagittal sections of coccygeal IVDs from the needle‐puncture model of ∼3 M old male mice (A‐C′). Scale bar = 200 μm. D, the heat map shows the mean score by 12 raters for the 14 histopathological features on six individual IVDs from the three cohorts. This data was used for analysis in sub‐figures E‐I. E, PC scores for 14 features based on the model determined by PCA analysis. F, k‐mean cluster membership and Euclidean distance from cluster center of the six IVD samples. G, ROC curve and area under the ROC (AUROC) curve with time following needle‐puncture compared to control cohort. Histogram for mixed‐model ANOVA and Tukey's multiple comparison test analyzing individual IVD region per cohort (H), and cumulative score for the entire IVD per cohort (I). Error bar in H and I shows mean ± SD. ns, not significant, * P < .05, ** P < .01, *** P < .001, and **** P < .0001PCA was run and two components, PC1 (4.25 eigenvalue, 70.75% variance), PC2 (1.04 eigenvalue, 17.37% variance) were extracted. PCA analysis shows that based on the scores for six samples, features for specific IVD region cluster together (Figure 14E). Analysis of the k‐means cluster membership and Euclidean distance data (from Figure 13D,E) for these six images shows that intact samples are a member of cluster 4, which was formed by the neonates and controls, one‐week post‐injury were split in clusters 3 and 4 but was furthest away from the cluster center for cluster 4 (Figure 14F). The 4‐week post‐injury samples were a member of cluster 1, formed by aged and other needle‐puncture models (Figure 14F). Next, the sensitivity and specificity of the 14 features in quantifying histopathological changes was tested by analyzing the area under the ROC curve (AUROC). AUROC was high for both comparisons; for intact compared to 4‐week post‐injury it was 0.92 (0.8‐1 95% CI, P = .0002), and for intact compared to one‐day post‐injury it was 0.82 (0.65‐0.9 95% CI; P = .0038) (Figure 14G).Next, we tested whether the new histopathological scoring system could quantify histological changes between the three cohorts using a mixed model ANOVA with Tukey's multiple comparisons test (Figure 14H). Data were analyzed in two ways: (a) using sum score for specific IVD region (NP, AF, EP, and interface) and comparing IVD regions between cohorts (Figure 14H); and (b) adding the sum scores for IVD regions (same as all 14 features) to generate a cumulative score for the entire IVD and comparing results between cohorts (Figure 14I). Significant differences were detected between each IVD region of all cohorts by both the methods. The EP was least affected by needle‐puncture and showed changes after 4 weeks only.
Application to the static tail compression model using FAST stained IVD images
Coccygeal IVDs of 3‐5 M old male and female mice representing control, and from within the loop
were sectioned in the sagittal plane and stained with FAST (Figure 15A‐C′). Two IVDs per cohort were scored (11‐12 raters) for the 14 histopathological features. Heat map shows the mean histopathological score of each feature for all six IVDs from the three cohorts (Figure 15D). Changes were observed in AF and minor changes in the interface region. PCA analysis shows the dispersion of 14 features in PC1 (3.92 eigenvalue, 65.39% variance), and PC2 (1.61 eigenvalue, 26.9% variance) based on the scores for the six samples (Figure 15E). The k‐means cluster membership and Euclidean distance of these images show that control samples belong to cluster 4, formed by the neonates and controls. The IVDs that underwent static compression in the tail‐loop were members of cluster 3 formed by group with mild degenerative changes, although the two cohorts were separated from the cluster center (Figure 15F). AUROC for the control compared to early‐degeneration cohort was 0.86 (0.7‐1 95% CI; P = .001) and for control compared to late‐degeneration cohort was 0.93 (0.84‐0.1 95% CI; P = .0001) (Figure 15G), both showing high sensitivity and specificity. Next, using a mixed model ANOVA and Tukey's multiple comparisons test we quantified the changes in each region of the IVD between cohorts, and the overall changes in the IVDs of the three cohorts using the new histopathological scoring (Figure 15H). Similar to the pattern observed in the heat map, significant differences were observed in the AF and interface region when analyzed individually (Figure 15H). Analysis of the cumulative score for the entire IVD for each sample shows significant differences between cohorts (Figure 15I).
FIGURE 15
Validation using FAST stained tail loop model. Testing and validation of the new histopathological scoring system and listed features using FAST stained sagittal sections of coccygeal discs from the tail‐loop model of ∼3 to 5 M old male and female mice (A‐C′). Scale bar = 200 μm. D, the heat map shows the mean score by 12 raters for the 14 histopathological features on six individual IVDs from the three cohorts. This data was used for analysis in sub‐figures E‐I. E, PC scores for 14 features based on the model determined by PCA analysis. F, k‐mean cluster membership and Euclidean distance from cluster center of the six IVD samples. G, ROC curve and area under the ROC (AUROC) curve with time following tail loop compared to control cohort. Histogram for mixed‐model ANOVA and Tukey's multiple comparison test analyzing individual IVD region per cohort (H), and cumulative score for the entire IVD per cohort (I). Error bar in H and I shows mean ± SD. ns, not significant, * P < .05, ** P < .01, and **** P < .0001
Validation using FAST stained tail loop model. Testing and validation of the new histopathological scoring system and listed features using FAST stained sagittal sections of coccygeal discs from the tail‐loop model of ∼3 to 5 M old male and female mice (A‐C′). Scale bar = 200 μm. D, the heat map shows the mean score by 12 raters for the 14 histopathological features on six individual IVDs from the three cohorts. This data was used for analysis in sub‐figures E‐I. E, PC scores for 14 features based on the model determined by PCA analysis. F, k‐mean cluster membership and Euclidean distance from cluster center of the six IVD samples. G, ROC curve and area under the ROC (AUROC) curve with time following tail loop compared to control cohort. Histogram for mixed‐model ANOVA and Tukey's multiple comparison test analyzing individual IVD region per cohort (H), and cumulative score for the entire IVD per cohort (I). Error bar in H and I shows mean ± SD. ns, not significant, * P < .05, ** P < .01, and **** P < .0001
Application to lumbar IVDs of a genetic model using SafO/Fast green stained images
Lumbar IVDs (L4‐L6) of 12 M old male and female mice from wild‐type control, and TonEBP heterozygotes (TonEBP/+) were sectioned in the coronal plane and stained with SafO/Fast green/H (Figure 16A‐C′). The IVDs of TonEBP/+ mice demonstrated varied pathological phenotypes and were grouped as TonEBP/+_1 and TonEBP/+_2.
Images of two IVDs per cohort were scored (11‐12 raters) for the 14 histopathological features listed in the new scoring system. The mean score of all raters for each feature is shown in the heat map comparing the three cohorts (Figure 16D). PCA analysis shows the dispersion of 14 features in PC1 (4.71 eigenvalue, 78.6% variance), and PC2 (0.96 eigenvalue, 15.95% variance) based on the scores for the six samples (Figure 16E). The k‐means cluster membership and Euclidean distance show that the two replicates from control correctly clustered together in cluster 4. One of the TonEBP/+_1 replication was in cluster 3 of mild, and other in cluster 2 of moderate IVD degeneration groups. Both replicates of the TonEBP/+_2 cohort were together in cluster 1, formed by severely degenerated IVDs (Figure 16F). AUROC for the control compared to TonEBP/+_1 was 0.83 (0.68 to 0.99 95% CI; P = 0.0026), and for control compared to TonEBP/+_2 was 0.93 (0.86 to 0.1 95% CI; P = .0001; Figure 16G) indicating high sensitivity and specificity. Next, using mixed model ANOVA and Tukey's multiple comparisons test we quantified the changes in each region of the IVD between cohorts, and the overall changes in the IVDs of the three cohorts using the new histopathological scoring (Figure 16H). Significant differences were observed between the cohorts in each IVD region and when grouped together to generate a cumulative score.
FIGURE 16
Validation using SafO/Fast green stained genetic mouse model. Testing and validation of the new histopathological scoring system and listed features using SafraninO/Fast green and hematoxylin‐stained coronal sections of lumbar discs from the ∼12 M old TonEBP+/− and wild‐type control male and female mice (A‐C′). Scale bar = 200 μm. D, the heat map shows the mean score by 12 raters for the 14 histopathological features on six individual IVDs from the three cohorts. This data was used for analysis in sub‐figures E‐I. E, PC scores for 14 features based on the model determined by PCA analysis. F, k‐mean cluster membership and Euclidean distance from cluster center of the six IVD samples. G, ROC curve and area under the ROC (AUROC) curve for the two grades of degeneration compared to control cohort. Histogram for mixed‐model ANOVA and Tukey's multiple comparison test analyzing individual IVD region per cohort (H), and cumulative score for the entire IVD per cohort (I). Error bar in H and I shows mean ± SD. ns, not significant, ** P < .01, and **** P < .0001
Validation using SafO/Fast green stained genetic mouse model. Testing and validation of the new histopathological scoring system and listed features using SafraninO/Fast green and hematoxylin‐stained coronal sections of lumbar discs from the ∼12 M old TonEBP+/− and wild‐type control male and female mice (A‐C′). Scale bar = 200 μm. D, the heat map shows the mean score by 12 raters for the 14 histopathological features on six individual IVDs from the three cohorts. This data was used for analysis in sub‐figures E‐I. E, PC scores for 14 features based on the model determined by PCA analysis. F, k‐mean cluster membership and Euclidean distance from cluster center of the six IVD samples. G, ROC curve and area under the ROC (AUROC) curve for the two grades of degeneration compared to control cohort. Histogram for mixed‐model ANOVA and Tukey's multiple comparison test analyzing individual IVD region per cohort (H), and cumulative score for the entire IVD per cohort (I). Error bar in H and I shows mean ± SD. ns, not significant, ** P < .01, and **** P < .0001Overall, application of the new histopathological scoring system to three different mouse models of IVD degeneration for which histological samples were prepared using varied SOPs showed that the features described in the new histopathological scoring system can distinguish significant differences between even minor histopathological changes with high sensitivity and specificity.
DISCUSSION
Histopathological and structural changes in the IVD are crucial outcome measures due to their effect on IVD function. The mouse as a preclinical model to understand the structure‐function relationship of the IVD has gained importance primarily due to the relevant genetic and behavioral approaches for elucidating the mechanisms of IVD pathologies.
,
This study aimed to develop a comprehensive but easy to adapt mouse IVD histopathological scoring criterion, which captures the degenerated features noted in pathological human IVD tissues. This system enables better cross‐study comparison of mouse models, and is sensitive to quantify histopathology in mouse models of IVD degeneration and regeneration. We developed a list of 14 histopathological scoring features based on a literature review, previous IVD scoring systems, and a survey of the spine community and tested them using several mouse models of IVD degeneration. Each scoring feature was categorized using a point‐based linear order of equal interval, enabling the analysis of specific features as the IVD progresses from normal to severe degeneration. This is one of the strengths of the new scoring criteria, as it enables equal distribution of weights across features for determining pathology of each IVD region; and the final score is not influenced by a few features listed only in the highest‐scoring category. Moreover, the new histopathological scoring criteria can quantify each region of the IVD separately and these scores can be summed to generate a cumulative score to determine overall histopathological and structural changes in the IVD. As indicated by the survey respondents, we recommend comparisons between IVDs of the same level of the IVD and from the same region of the spine between cohorts.One goal in developing a new IVD histopathological scoring system was its utilization for cross‐study comparisons of mouse IVD degeneration and regeneration models. Hence, we analyzed and captured phenotypic changes in multiple models representing mice from postnatal day seven (P7) to over two years of age, both sexes, commonly used genetic strains, and several different SOPs utilized for histological preparation (Figure 11). We also used IVD images from these models to test the new histopathological scoring criteria using both trained and novice raters from various labs (Figure 12A). Using the large image set, the reliability of the 14 histopathological features was tested for overall agreement and agreement for each scoring category (Figure 12B,C, Tables 2, 3, 4, 5). Experienced raters demonstrated substantial to an almost perfect overall agreement by Fleiss's multi‐rater κ, similar to results following four rounds of training by novice raters. Hence, we recommend that all raters, independent of previous experience, should undergo training to become familiar with the scoring features before applying it for their experiments. Fair to moderate agreement was observed by both cohorts of raters for the middle‐categories, indicating that the observers have difficulty distinguishing subtle progressive histopathological changes. Although the survey results showed inclination for a 0 to 5 scoring range (Figure 4J), our reliability analysis results showed that several categories should not be used for quantifying observational data. Analysis of ICC's magnitude of agreement and Cohen's κw tests showed excellent and almost perfect agreement. Intra‐rater reliability tested by Fleiss's κ showed substantial to an almost perfect agreement for overall and each scoring category (Tables 2, 3, 4, 5), indicating the reproducibility of histopathological observations based on the listed features.The survey results (Figure 4D) and discussions within the group highlighted that various histochemical stain are routinely applied in mouse IVD histology. Hence, the impact of histochemical staining on the visualization of histological features in the new scoring system was tested. Relative comparisons of histopathological features by Fleiss's multi‐rater κ showed that IVD images stained with H&E had a higher agreement by raters independent of their training for most features compared to Safranin‐O/Fast green and FAST stained IVD images. FAST stained images had only slight to a fair agreement which was lower than both H&E and Safranin‐O/Fast green stained images by all raters independent of their training. A caveat of this analysis was that the comparisons were not made on serial sections from the same samples, and will require further investigation. The model system and SOPs applied for preparation of FAST and Safranin‐O/Fast green were similar, as were those for the majority of H&E‐stained sections (Figure 12D), and hence are unlikely to affect the reliability tests. Moreover, the pathologist on this study commented that pathological examinations are routinely carried out on H&E‐stained sections.The rigor of the new histopathological scoring system was verified by machine learning based statistical methods, which confirmed the 14 histopathological features' sensitivity and specificity for accurate prediction of mouse IVD degeneration. Moreover, the new scoring system's application on the established model of mouse IVD degeneration showed that the features could quantify the histopathological changes with high sensitivity compared to controls.During the initial round of testing of the scoring system, we used a single total scoring method for each disc region, which resulted in fair reliability. Moreover, when we tested the new histopathological features by giving a single total score for each disc region and not for each feature within the region, it showed substantial agreement. However, it failed to predict the degeneration model (Figure S2) accurately. We therefore discourage users from using a total scoring method and instead recommend scoring each of the 14 features separately. Moreover, cross validation by applying unsupervised machine learning algorithms to observational data with higher inter‐rater agreement (Figure S2), indicates the importance of rigorous statistical analysis for building scoring criteria.Neovascularization of the AF was initially considered but based on the response received from the survey (Figure 4E) and following discussion, there was a consensus that vascularization would be more viable. Additional and specific staining methods are required for visualization of vascularization. Following feedback from the pathologist on the test images, features related to vascularization of AF was excluded from the scoring criteria.While the survey respondents positively regarded disc aspect ratio analysis (Figure 4I), since this morphometric change requires quantification measurements and cannot be relied on observation, it was excluded from the features of the histopathological scoring. Disc aspect ratio,
and other morphometric parameters used to quantify structural changes in IVD
,
,
are nevertheless essential and may be analyzed as an additional outcome measures as the study warrants.There are a few limitations to the current study. While several histopathological features for scoring EP were discussed, due to strain‐related differences in the EP, only the features uniformly observed between various mouse strains were included for histopathological scoring. As Schmorl's nodes are not often observed in mouse IVDs, their absence (normal, 0) and presence (severe, 2) were scored on a binary scale consistent with the EP category's scoring range. The score for Schmorl's node should be carefully recorded. Further studies are required to test the new histopathological system on mouse models of either prevention of IVD degeneration or it's regeneration. Based on the histological changes observed in one such study of IVD reactivation using the sacral disc as a model,
we suggest the new histopathological scoring system can be adapted for quantifying histopathological changes associated with regeneration. Lastly, while the NP and AF were scored on a four‐point scale with the highest score of 3; EP and Interface were scored on three‐point scale with the highest score being 2. During the initial test run, the consensus was that NP and AF are most affected by degeneration, and the features can be distinguished when detailed into four categories. However, it was not possible to do the same for the EP or interface. As discussed above, a more comprehensive scoring range results in inconsistent scores, affecting further analysis and reproducibility.The limited data set in the study used for validation by machine learning algorithms supports the potential of the 14 histopathological features for building predictive models. However, for robust machine learning approaches for modeling will require validation on a larger data set. Considering the sensitivity and specificity of the current scoring criteria by human raters in the current study, it may be feasible to test them directly on images in future studies. Importantly, with the advancement of new technologies, models and our knowledge regarding IVD pathologies, it will be crucial to revise and update this scoring system.Overall, using several mouse models of IVD degeneration from both sexes and all ages and controlling for the variability in the SOPs, we rigorously tested the reliability of all features and each scoring category using a large group of raters. Moreover, we tested the new histopathological scoring system for quantitative analysis using unsupervised and supervised machine learning algorithms and validated that the 14 histopathological features accurately predicted IVD degeneration with high sensitivity and specificity. As the new histopathological scoring system captures several human IVD pathologies, it will help quantification of preclinical mouse models will inform for degenerative and regenerative approaches for translation research.
METHODOLOGY AND STATISTICAL ANALYSIS
Survey design and analysis
A detailed survey was designed specifically for IVDs of mouse model. The survey was deemed as exempt research by the Corporal Michael J. Crescenz Veterans Affairs (VA) Medical Center Institutional Review Board (Protocol #01862). Multiple‐choice questions captured the commonly used SOPs for histopathological preparation and percent response to each category is presented as pie charts. A six‐point Likert scale from least important (scale 0) to most important (scale 5) captured response regarding the importance of scoring features. The percent response for each point was plotted on a component band chart. Further consensus on parameters to consider while developing the new mouse IVD histopathological scoring system was gathered using closed‐ended questions and percent response to each category is plotted on a component band chart. Frequencies for response to each category were computed using SPSS 27 and data were as plotted using GraphPad Prism 9.
Description of models utilized for testing the new scoring system
The frequency distribution of multiple variables for the 214‐individual images of mouse IVD used in the study was determined using cross‐tabulation in SPSS 27 (Table S1). In summary, lumbar (29.9%) and coccygeal (70.1%) discs from female (9.8%), male (15%), or mice of both sexes (75.2%) that belonged to C57BL/6J (49.5%), SM/J (21.5%), FVB (14.5%), B6 and DBA (4.7%) backgrounds were analyzed. The genetic background of 9.8% of mice was not reported (NR, Figure 11). The spines were processed either using 4% PFA and EDTA (93%) or Decalcifier I solution (7%) and embedded in paraffin and sectioned (85.5%) or cryosectioned (14.5%). The molarity of EDTA varied between labs that shared the images. Sections were prepared either in the coronal (75.7%) or sagittal (24.3%) plane. The age spanned from early postnatal (P7) to aged (24 M) and both male and female mice were analyzed. The neonatal (7.5%),
natural aging (42.7%),
,
,
,
needle‐puncture and matched controls (16.4%),
tail‐loop and matched control (2.8%),
and various genetic mutants including Sox9‐cKO and matched controls (9.3%),
TonEBP+/− and matched controls (2.8%),
Ercc1+/− and matched controls (7%),
NODSCID and matched controls (3.3%),
Asporin Tg and matched controls (2.8%), and Bailey and matched controls (0.9%) were analyzed. The sections were stained with FAST (16.8%), H&E (26.2%), Safranin‐O/Fast green, and hematoxylin (57%) (Figure 11). Using images of IVD from multiple biological variables and prepared using various SOPs helps in rigorous testing of the new histopathological scoring system, and its successful application to studies using mouse as a preclinical model system for IVD research.
Fleiss's multi‐rater kappa test for agreement
The histopathological scores for all 14 features were processed to analyze the reliability of the scoring features and criteria by determining the strength of agreement between raters using SPSS 27 and Fleiss' multi‐rater kappa (κ) reliability test. Fleiss's multi‐rater κ recommended for testing the agreement between more than two raters for nominal, ordinal, and continuous data, tests κ for overall agreement, and agreement for each category of the observational data. Hence, Fleiss's κ also analyzes agreement between the raters for the middle categories of observational data. A κ of 0 indicates no agreement, and a κ of 1 indicates absolute agreement. Similar to other scale tests, there is no rule of thumb to categorize κ value and interpret its magnitude or strength, as agreement for observational data may vary with the kind of study. Fleiss suggested guidance to carefully consider interpreting the strength of agreement for weighted κ (κw) and unweighted κ, where κ > 0.75 or so may indicate excellent overall agreement; and κ of less than 0.4 or so may indicate poor agreement.
Another guide used to interpret the magnitude of the agreement for observational data is that used by Landis and Koch in a study on the diagnosis of neurological conditions where they divided the κ into small categories.
Based on these categories κ < 0.2 may indicate slight, κ of 0.21 to 0.04 may indicate fair, κ of 0.41 to 0.6 may indicate moderate, κ of 0.61 to 0.8 may indicate substantial, and κ > 0.81 may indicate almost perfect agreement. A few raters indicated that the clarity of six images was low; hence, the reliability tests are based on 208 de‐identified images. Intra‐rater reliability was tested for two raters who scored 75 images for the 14 histopathological features. The tests were run to determine overall κ, κ for each scoring category, statistical significance, and 95% confidence interval (Tables 2, 3, and 5 and Table S5). In this study, we are following Landis and Koch's criteria to interpret the results.
ICC and Cohen's kappa test for reliability
The scores were processed as mentioned above. Cohen's unweighted kappa (κ) is recommended for testing inter‐rater reliability for nominal data. While Cohen's κw may be used for ordinal data, it assesses reliability by assigning weights to the degree of disagreement between two raters. ICC measures the degree of correlation between measurements made by different raters that may be used to interpret the reliability. The scores for the same 202 de‐identified images were analyzed by ICC for multi‐rater reliability using data for all raters for all stains for the Fleiss's multi‐rater κ (Table 3 and Figure 12C all raters). Inter‐rater reliability for two raters tested Cohen's κw used the same data from two experienced raters from Lab‐A (Table 2 and Figure 12B). SPSS 27 was used to determine the ICC coefficient and κw, statistical significance and 95% confidence interval (Table 4). When calculating the ICC, the valid subjects (images) were reduced from 208 to 202 as one rater did not score six images due to conflict. Interpretation of κw and ICC is based on the Landis and Koch criteria described above.
Pearson product moment correlation coefficient
Pearson product moment correlation, also known as Pearson's correlation or r was run on GraphPad Prism 9 to determine the relationship between the 14 histopathological features (variables) using a mean score from 12 blinded independent raters for 214‐IVD images. Pearson's r was computed for every pair of data sets. Significance was determined using a Two‐tailed test of significance and 95% confidence interval (Table S2). Pearson's r can range from +1 (strongest positive association) to −1 (negative association). And an r = indicates that there is no association. For interpreting the positive correlation, similar to all scale values, correlation coefficients are thought to be difficult to categorize. In this study, we have adapted the systems where r is categorized; r ≤ .35 or so signifies a low or weak correlation, .36 to .67 or so signifies modest or moderate correlation, .68 to 1 signifies strong or high correlation, and r ≥ .9 very high correlation. The value of P was <.000001 for all comparisons and considered highly significant.
K‐means clustering
Unsupervised machine learning using k‐means clustering algorithm was run on SPSS 27. As the scoring data is on the same scale and at an equal interval, it meets the assumptions of running k‐means clustering. Histopathological scores for 14 features from 12 blinded raters for 214 IVD images were processed. First, using TwoStep clustering, which generates a pre‐cluster of data into an automatically selected number of clusters, four clusters (k = 4) of fair quality were created. Euclidean distance calculated the distance from the cluster center. The final cluster center each of the 14 features, distance between final clusters and number of cases in each cluster were analyzed (Table S3).
Principal component analysis
Unsupervised machine learning using PCA algorithm was run for dimension reduction and predictive modeling of the histopathological changes in mouse IVDs. Before running PCA, we tested whether our data passes at least four assumptions required to run PCA, which includes that the data (a) multiple variables (n = 14 features) measured at the equal interval (0, 1, 2, and 3); (b) have a linear relationship (Figure 13C); (c) a large data set (n = 214), and (d) do not include significant outliers. As the data met all four assumptions, PCA was run using GraphPad Prism 9 and data for 14 variables (features) for 214 mouse IVDs from 12 blinded raters. Two principal components (PCs), PC1 and PC2, were selected based on the largest eigenvalue of 11.64 and 0.84. The percent variance of PC1 and PC2 was 83.17% and 6.04%, respectively. P < .05 was considered significant.Unsupervised validation of the 14 features in three different models of mouse IVD degeneration was performed by PCA (Figures 14C, 15C, and 16C).
Artificial neural networks and multilayer perceptron
Supervised deep learning was applied for predictive modeling of mouse IVD histopathology using ANN and MLP algorithm and run using SPSS 27. The mean of total histopathological score by 12 raters was used to interpret the IVD health and degeneration based on the criteria proposed in Figure 9B and used as the dependent variable for classification (normal, mild, moderate, and severe IVD degeneration). Histopathological scores for the 14 features from 12 blinded raters for 214 IVD images were used as covariates to determine their application for training the machine to correctly predict the IVD health and degeneration. Age, model of IVD degeneration, sex, mouse strain, plane of section, region of spine and histochemical stain were used as factors. Partition dataset was generated by randomly assigning cases (models and their associates factors and covariates) based on the relative number of cases into 70% training dataset and 30% testing dataset. ROC was based on the pseudo‐probability. Details on network performance are provided in Table S4. To cross validate the ANN MLP predictions, first the classification (normal, mild, moderate and severe IVD degeneration) of the 214 images were number ordered. Next, Spearman's rank order correlation was run using SPSS to test the strength and magnitude of correlation between the actual classifications compared to the predicted classification of the entire data set. The Spearman's rho, statistical significance and 95% confidence interval were determined.
ROC curve
ROC curve was applied to evaluate the performance of the new mouse IVD histopathological scoring system. ROC tests the sensitivity and specificity of the classification or scoring system. ROC curve for the application testing dataset presented in Figures 14G, 15G, and 16G was run using GraphPad prism 9 to test the performance of the 14 IVD histopathology‐scoring features. The sensitivity and specificity of the 14 features were tested by ROC curve using Wilson and Brown method for computing 95% confidence interval.
Mixed model ANOVA
The quantification of histological changes using the new mouse IVD histopathological scoring criteria for 14 features were analyzed using mixed model ANOVA on GraphPad prism 9 to compare difference between the three experimental cohorts, and within‐group factors including the four IVD regions (NP, AF, EP, and interface). A main effect was determined by Tukey's multiple comparisons test. P < .05 was considered significant.
AUTHOR CONTRIBUTION
Chitra L. Dahia, Makarand V. Risbud, Danny Chan, Cheryle A. Séguin, and Nam Vo conceptualized and designed the study. Literature review and comparisons of prior scoring systems was conducted by Itzel Paola Melgoza, Srish S. Chenna, and Chitra L. Dahia. Chitra L. Dahia designed the survey, and analyzed the data. Chitra L. Dahia, Makarand V. Risbud, Danny Chan, Cheryle A. Séguin, Nam Vo, Simon Y. Tang, Yejia Zhang, Victor Y. Leung, Angela K. Brice, Itzel Paola Melgoza, Srish S. Chenna, OT, and Simon Y. Tang contributed to development of scoring criteria and list of histopathological features. Angela K. Brice, the pathologist on the study, reviewed the histopathological scoring criteria, and language for pathological evaluation. Images for testing were contributed from the labs of Makarand V. Risbud, Chitra L. Dahia, Danny Chan, Nam Vo, Yejia Zhang, Simon Y. Tang, and Victor Y. Leung. Raters for testing the scoring criteria included Itzel Paola Melgoza, Srish S. Chenna, Steven Tessier, Yejia Zhang, Simon Y. Tang, Takashi Ohnishi, Geoffrey J. Kerr, Emanuel José Novais, Sarthak Mohanty, Vivian Tam, Wilson C. W. Chan, Chao‐Ming Zhou, Ying Zhang, and Chitra L. Dahia. Data processing, statistical analysis and predictive modeling, and figure preparation was done by Chitra L. Dahia. Inter‐rater and intra‐rater reliability analysis was conducted by Itzel Paola Melgoza, Srish S. Chenna and Chitra L. Dahia. Article was drafted by Chitra L. Dahia, Itzel Paola Melgoza, Srish S. Chenna, and Sarthak Mohanty, and all authors read, edited and approved of the final article for submission.
CONFLICT OF INTEREST
The authors have no relevant conflict of interest to disclose in relation to this study.Appendix
S1: Supporting InformationClick here for additional data file.
Authors: Norbert Boos; Sabine Weissbach; Helmut Rohrbach; Christoph Weiler; Kevin F Spratt; Andreas G Nerlich Journal: Spine (Phila Pa 1976) Date: 2002-12-01 Impact factor: 3.468
Authors: Matthew R McCann; Priya Patel; Michael A Pest; Anusha Ratneswaran; Gurkeet Lalli; Kim L Beaucage; Garth B Backler; Meg P Kamphuis; Ziana Esmail; Jimin Lee; Michael Barbalinardo; John S Mort; David W Holdsworth; Frank Beier; S Jeffrey Dixon; Cheryle A Séguin Journal: Arthritis Rheumatol Date: 2015-05 Impact factor: 10.995
Authors: Dino Samartzis; Jaro Karppinen; Florence Mok; Daniel Y T Fong; Keith D K Luk; Kenneth M C Cheung Journal: J Bone Joint Surg Am Date: 2011-04-06 Impact factor: 5.284
Authors: Bin Han; Kai Zhu; Fang-Cai Li; Yu-Xiang Xiao; Jie Feng; Zhong-Li Shi; Min Lin; Jun Wang; Qi-Xin Chen Journal: Spine (Phila Pa 1976) Date: 2008-08-15 Impact factor: 3.468
Authors: Maria Tsingas; Olivia K Ottone; Abdul Haseeb; Ruteja A Barve; Irving M Shapiro; Véronique Lefebvre; Makarand V Risbud Journal: Matrix Biol Date: 2020-10-04 Impact factor: 11.583
Authors: Sora Al Rowas; Rami Haddad; Rahul Gawri; Abdul Aziz Al Ma'awi; Lorraine E Chalifour; John Antoniou; Fackson Mwale Journal: Arthritis Res Ther Date: 2012-01-23 Impact factor: 5.156
Authors: Shirley N Tang; Benjamin A Walter; Mary K Heimann; Connor C Gantt; Safdar N Khan; Olga N Kokiko-Cochran; Candice C Askwith; Devina Purmessur Journal: Front Pain Res (Lausanne) Date: 2022-06-22
Authors: Frances C Bach; Deepani W Poramba-Liyanage; Frank M Riemers; Jerome Guicheux; Anne Camus; James C Iatridis; Danny Chan; Keita Ito; Christine L Le Maitre; Marianna A Tryfonidou Journal: Front Cell Dev Biol Date: 2022-03-14