Literature DB >> 35935919

A probable cis-acting genetic modifier of Huntington disease frequent in individuals with African ancestry.

Jessica Dawson¹, Fiona K Baine-Savanhu¹, Marc Ciosi², Alastair Maxwell², Darren G Monckton², Amanda Krause¹.

Abstract

Huntington disease (HD)is a dominantly inherited neurodegenerative disorder caused by the expansion of a polyglutamine encoding CAG repeat in the huntingtin gene. Recently, it has been established that disease severity in HD is best predicted by the number of pure CAG repeats rather than total glutamines encoded. Along with uncovering DNA repair gene variants as trans-acting modifiers of HD severity, these data reveal somatic expansion of the CAG repeat as a key driver of HD onset. Using high-throughput DNA sequencing, we have determined the precise sequence and somatic expansion profiles of the HTT repeat tract of 68 HD-affected and 158 HD-unaffected African ancestry individuals. A high level of HTT repeat sequence diversity was observed, with three likely African-specific alleles identified. In the most common disease allele (30 out of 68), the typical proline-encoding CCGCCA sequence was absent. This CCGCCA-loss disease allele was associated with an earlier age of diagnosis of approximately 7.1 years and occurred exclusively on haplotype B2. Although somatic expansion was associated with an earlier age of diagnosis in the study overall, the CCGCCA-loss disease allele displayed reduced somatic expansion relative to the typical HTT expansions in blood DNA. We propose that the CCGCCA loss occurring on haplotype B2 is an African cis-acting modifier that appears to alter disease diagnosis of HD through a mechanism that is not driven by somatic expansion. The assessment of a group of individuals from an understudied population has highlighted population-specific differences that emphasize the importance of studying genetically diverse populations in the context of disease.

Entities: Chemical

Keywords: African ancestry; CAG repeat; CCGCCA loss; Huntington disease; cis-acting modifier; genetically diverse

Year: 2022 PMID： 35935919 PMCID： PMC9352962 DOI： 10.1016/j.xhgg.2022.100130

Source DB: PubMed Journal: HGG Adv ISSN： 2666-2477

Introduction

Huntington disease (HD; MIM: 143100) is a dominantly inherited neurodegenerative disorder, caused by expansion of the CAG repeat tract in exon 1 of the huntingtin (HTT; MIM: 613004) gene to 36 or more repeats. The inherited expanded CAG size in individuals affected with HD is 36 or more repeats, and inversely correlates with the age of onset (AoO)., Thus, the longer the CAG repeat length inherited, the earlier the onset of HD symptoms. Expanded CAG repeat alleles are not only unstable in the germline, with a bias toward repeat length increases in successive generations, but are also unstable in somatic cells. Factors such as the length of the repeat, the age of the individual, and cell type affect the degree of somatic mosaicism observed.5, 6, 7, 8 The typical sequence of the HTT repeat tract is made up of a polyglutamine region (Q = glutamine) encoded by a number of pure CAG repeats in the first position (Q1 = CAG) and an intervening CAACAG sequence in the second position (Q2 = CAACAG). These are followed by the polyproline region (P = proline) encoded by the intervening CCGCCA sequence (P1 = CCGCCA), a stretch of pure CCG repeats (P2 = CCG), and lastly two downstream CCT repeats (P3 = CCT) (Figure 1). The HTT repeat tract has been reported to be a hotspot for variants8, 9, 10, 11 that alter the sequence relative to the reference. This may include duplication or loss of all or part of the intervening sequence between the pure CAG and CCG repeats (CAACAG-CCGCCA in typical alleles) and variation in the number of the downstream CCT repeats.8, 9, 10, 11, 12, 13

Figure 1

The HTT disease and non-disease allele structures in African ancestry individuals

Schematic representation of the HTT disease and non-disease allele structures defined for this study. The typical allele structures were grouped together as Q1-2-2-P2-2, while the atypical allele structures are shown individually for deviations from the reference allele structure to be clearly demonstrated.

The HTT disease and non-disease allele structures in African ancestry individuals Schematic representation of the HTT disease and non-disease allele structures defined for this study. The typical allele structures were grouped together as Q1-2-2-P2-2, while the atypical allele structures are shown individually for deviations from the reference allele structure to be clearly demonstrated. In other repeat expansion disorders such as spinocerebellar ataxia type 1 (SCA1), myotonic dystrophy type 1 (DM1), and fragile X syndrome, repeat stability and disease severity are affected by mutations within the repeat tract (i.e., atypical allele structures).14, 15, 16, 17, 18 In SCA1 and fragile X syndrome, the stabilizing interruptions have been identified on unexpanded alleles, while they are absent or very rare on expanded alleles., Mutations interrupting the HTT repeat tract may have a similar effect for HD. A recent study of the HTT repeat tract in over 800 European individuals affected with HD revealed that atypical allele structures are more frequent than previously thought (∼8% of non-disease alleles and ∼3% of disease-associated alleles). Several studies have now shown that HD severity is best explained by the length of the pure CAG repeat tract (Q1) and not by the length of the polyglutamine tract encoded (Q1 + Q2).,, Given that somatic expansion of the CAG repeat is also best predicted by the number of pure CAG repeats, coupled with the observation that CAG length-independent variation in age at onset is associated with DNA repair gene variants, these data confirm that somatic expansion is a key driver of HD onset.,, Whether somatic expansion in African individuals affected with HD is modified by the same set of cis-acting sequence HTT repeat variants and trans-acting DNA repair gene variants or additional African-specific genetic variants is yet to be determined. Although HD has been reported worldwide, there are distinct geographic differences in prevalence, with the lowest rates in African populations and those with African ancestry., These differences are particularly reflected in South Africa, where HD has been reported in three different groups (European ancestry, mixed ancestry, and African ancestry individuals) but at varying prevalence estimates (7.8, 2.2, and 0.5 per 100,000 individuals, respectively). In general, there is a greater amount of genetic diversity present in sub-Saharan African populations across the genome in comparison with all other populations. Specifically at the HTT locus, a large number of haplotypes have been previously defined, and, among South African non-disease and disease alleles, a unique C haplotype variant (C-SA) has been identified. It is therefore a reasonable expectation that there is more sequence diversity within the HTT repeat tract of African individuals. Sequence variation may have a similar modifier effect on the HD phenotype as seen in European ancestry individuals or provide unique insights into disease modification in African ancestry individuals. Using a high-throughput ultra-deep DNA sequencing assay specific for the HTT repeat tract, this study assessed HTT genetic diversity in a sample of South African HD-affected and -unaffected individuals. The results present the sequence variation identified in this complex region, background haplotypes, and the characterization of somatic expansion as potential genetic modifiers of the HD phenotype in individuals of African ancestry.

Materials and methods

Subjects

Blood DNA samples were sourced from the archives of the Division of Human Genetics, University of the Witwatersrand (Wits) and the National Health Laboratory Service (NHLS) in Johannesburg, South Africa, accumulated over the preceding 25 years. The study population comprised 68 unrelated individuals affected with HD and 158 unrelated individuals unaffected with HD, all of African ancestry (68 disease alleles, 384 non-disease alleles). All the individuals included in the present study were South African Bantu speakers from a small geographic area around Johannesburg (± 200 km). Population genetic structure is weak among South African Bantu speakers and is only relevant at a geographical scale, which is far larger than our study area. Our study population therefore constitutes a genetically homogenous Bantu-speaking population. Although only unrelated individuals were included in the sequence diversity assessment, HD-affected relatives of the probands were successfully sequenced to assess intergenerational instability. The affected individuals were originally referred for molecular diagnostic confirmation of disease status and, where available, AoO was patient reported after consultation with a medical geneticist or neurologist. The HD cohort comprised 42 females and 26 males, with an age of diagnosis ranging between 23 and 77 years; additional patient information is shown in the supplemental data (Table S1). Ethical approval for this study was obtained from the Human Research Ethics Committee (Medical), University of the Witwatersrand (certificate numbers M1704130, sub-study M110443).

HTT repeat tract sequence diversity

The HTT repeat tract sequencing followed an established ultra-deep high-throughput sequencing protocol developed to characterize the repeat tract precisely. Following sequencing on an Illumina MiSeq platform, genotyping was carried out using ScaleHD (v0.251) as previously described. The HTT repeat tract reference sequence is made up of a polyglutamine encoding region followed by the polyproline encoding region as shown in Figure 1. The allele structures were defined as either typical or atypical based on a comparison with the reference allele structure (LRG_763). The atypical alleles were defined based on deviations from the reference sequence as a result of variants within the repeat tract at the Q2, P1, and P3 regions.

HTT background haplotypes

The allele structures were investigated in the context of background haplotypes for the HTT locus spanning ∼196,063 kb on chromosome 4. Thirteen tag single nucleotide polymorphisms (tag-SNPs) were selected from previously studied haplotype SNPs, to define haplogroups A, B, and C, and the South African-specific haplogroup variant C-South Africa (C-SA) (Table S2). The tag-SNPs were genotyped using a MassARRAY System from Agena Bioscience and haplotypes constructed using manual and statistical phasing. Manual phasing was achieved using homozygous genotypes and repeat tract associations, while the statistical phasing was performed using PHASE (v2.1.1), which employs a Bayesian inference model., Two samples (one disease allele and three non-disease alleles) from the sequence diversity analysis were excluded due to unsuccessful tag-SNP genotyping. The LDhap tool from the LDlink suite was used to derive haplotype frequencies in the 1000 Genomes Project populations for the tag-SNPs used to define the most common disease haplotype.

Quantification of HTT somatic expansion

The ratio of CAG repeat somatic expansions of disease-associated alleles was determined from the MiSeq read count distributions as described previously. The somatic expansion score was then calculated as the residuals of the log-transformed ratio of somatic expansion after adjusting for the effect of the inherited expanded CAG repeat length, age at sampling, and their interaction using multiple linear regression.

Statistical analysis

Potential genetic modifiers of HD were investigated with multiple linear models in R (v3.4.3) using RStudio (v1.0.153). The lm function was used to determine associations between the HD phenotype and various explanatory variables: HTT repeat tract, background haplotypes and somatic expansion score. When studying the HD phenotype, the AoO of motor symptoms is the most well-defined and frequently used measure of disease severity. However, in our sample of HD-affected individuals, less than 50% had AoO information and, because the age of diagnosis (AoD) was available for all subjects and strongly correlated with AoO (r2 = 0.68, Figure S1), it was therefore used as a proxy for AoO. The assessment of the modifiers of the HD phenotype was conducted on the subset of HD-affected individuals with CAG repeat length between 39 and 52 repeats, as ≥53 CAG repeats violate linear model assumptions. As a result, four individuals were excluded from the analysis that had the following allele structures and haplotypes: two Q1-2-2-10-2 on haplotype C5, one Q1-2-2-7-2 on haplotype A4a, and one Q1-2-0-9-2 on haplotype B2. In the linear models, the reference for the allele structures and haplotypes was Q1-2-2-P2-2 and haplotype B2 respectively. To determine if the variation in the AoD was better explained by HTT allele structure or background haplotype, a goodness of fit test of the R-squared of the Q1-2-0-9-2 allele structure model and the haplotype B2 model in 5,000 bootstrapped samples was assessed. The estimated marginal mean AoD and expansion score were established using the emmeans function in R using RStudio.

Results

A total of 226 samples from individuals of African ancestry, 68 affected with HD and 158 unaffected (68 disease alleles and 384 non-disease alleles) were sequenced and genotyped. Seventeen different allele structures were identified and defined as either typical or atypical alleles (Table 1). The eight allele structures defined as typical had a variable number of CAG repeats and CCG repeats that ranged from six to 13 repeats. Nine allele structures were defined as atypical due to variants resulting in an apparent loss or duplication of the intervening sequences CAACAG (Q2 = 0 or 4) and CCGCCA (P1 = 0) and/or accompanied by an additional downstream CCT (P3 = 3) repeat. All the variants that resulted in the atypical alleles were synonymous and thus translated into huntingtin proteins with pure polyglutamine and pure polyproline regions. Of the 17 allele structures, three (one typical and two atypical) are unique to this study as they have not been previously described (asterisk in Table 1). Schematics for the disease and non-disease allele structures are shown in Figure 1.

Table 1

Summary of African ancestry HTT disease and non-disease alleles

Allele types	Allele structure nomenclature	HTT repeat tract						Allele occurrence				Fisher exact test
		Q¹		Q²	P¹	P²	P³	Non-diseaseN = 384		DiseaseN = 68		Fisher exact test
		CAG		CAACAG	CCGCCA	CCG	CCT	n	%	n	%	p values
Typical alleles	Q¹-2-2-6-2	14–17	–	2	2	6	2	11	2.9	–	–	0.384
	Q¹-2-2-7-2	15–28	41–55	2	2	7	2	99	25.8	10	14.7	0.064
	Q¹-2-2-8-2	17	–	2	2	8	2	5	1.3	–	–	1
	Q¹-2-2-9-2	15–28	40	2	2	9	2	29	7.6	1	1.5	0.066
	Q¹-2-2-10-2	11–20	40–54	2	2	10	2	71	18.5	20	29.4	0.048
	Q¹-2-2-11-2	12–21	–	2	2	11	2	18	4.7	–	–	0.089
	Q¹-2-2-12-2	17	–	2	2	12	2	1	0.3	–	–	1
	∗Q¹-2-2-13-2	17	–	2	2	13	2	1	0.3	–	–	1
Typical alleles subtotal								235	61.2	31	45.6
Atypical alleles	Q¹-2-2-4-3	23	–	2	2	4	3	1	0.3	–	–	1
	Q¹-2-2-6-3	15–23	42–44	2	2	6	3	2	0.5	4	5.9	5.587 × 10⁻³
	Q¹-2-2-9-3	12–21	–	2	2	9	3	92	24.0	–	–	9.142 × 10⁻⁸
	Q¹-2-2-10-3	16	–	2	2	10	3	1	0.3	–	–	1
	∗Q¹-4-2-4-3	–	42	4	2	4	3	–	–	1	1.5	0.154
	Q¹-4-2-7-3	14–19	–	4	2	7	3	22	5.7	–	–	0.059
	∗Q¹-4-2-10-2	16–19	–	4	2	10	2	4	1.0	–	–	1
	Q¹-2-0-9-2	16–32	40–58	2	0	9	2	27	7.0	30	44.1	3.119 × 10⁻¹³
	Q¹-0-0-9-2	–	39–46	0	0	9	2	–	–	2	2.9	0.022
Atypical alleles subtotal								149	38.8	37	54.4

The novel allele structures unique to this study are indicated by an asterisk (∗). The most common non-disease and disease allele structures are indicated in underlined italics. The statistically significant frequency differences between the non-disease and disease alleles are indicated in italics (non-disease alleles: Q1-2-2-10-2 p = 0.048 and disease alleles: Q1-2-2-6-3 p = 5.587 × 10−3, Q1-2-2-9-3 p = 9.142 × 10−8 and Q1-2-0-9-2 p = 3.119 × 10−13).

Summary of African ancestry HTT disease and non-disease alleles The novel allele structures unique to this study are indicated by an asterisk (∗). The most common non-disease and disease allele structures are indicated in underlined italics. The statistically significant frequency differences between the non-disease and disease alleles are indicated in italics (non-disease alleles: Q1-2-2-10-2 p = 0.048 and disease alleles: Q1-2-2-6-3 p = 5.587 × 10−3, Q1-2-2-9-3 p = 9.142 × 10−8 and Q1-2-0-9-2 p = 3.119 × 10−13). The most common disease allele Q1-2-0-9-2 (30 out of 68 = 44.1%) had an atypical structure defined by a CCGCCA loss (P2 = 0). Although also present in unaffected individuals, it represented a much smaller proportion of the non-disease alleles (27 out of 384 = 7.0%). The most common non-disease allele Q1-2-2-7-2 (99 out of 384 = 25.8%) had a typical structure, with variability occurring only in the length of the CAG repeat. When comparing disease and non-disease alleles, one typical allele structure, Q1-2-2-10-2, occurred more frequently in the non-disease alleles. Among the atypical alleles, the frequency of four structures differed significantly between the disease and non-disease alleles. Three of these, Q1-2-2-6-3, Q1-2-0-9-2, and Q1-0-0-9-2, were more frequent in disease alleles, while one atypical allele structure, Q1-2-2-9-3, was more frequent in non-disease alleles (Fisher exact p values in Table 1). The comparison of these African alleles with the European alleles previously described (746 disease alleles) revealed differences in the structures defined and their frequencies. Among European HTT alleles, 92.2% of the non-disease and 97.2% of disease alleles had a typical allele structure, compared with the African HTT alleles where only 61.2% of non-disease alleles were typical and 45.6% of disease alleles were atypical (non-disease, 235 out of 384 > 688 out of 746, Fisher exact test p < 2 × 10−16; versus disease alleles, 31 out of 68 > 725 out of 746, Fisher exact test p < 2 × 10−16). The most common allele structure, Q1-2-2-7-2 (typical allele structure), was the same for both non-disease and disease alleles in European individuals and in the African non-disease alleles. However, the most common African disease allele structure, Q1-2-0-9-2 (i.e., CCGCCA loss, P2 = 0), was atypical and reportedly rare (non-disease alleles, 30 out of 746 = 4.0%; disease alleles, 0 out of 746 = 0%) among European disease alleles. A particularly interesting case of intergenerational instability was identified in association with the most common African disease allele structure, Q1-2-0-9-2, when relatives of the proband were assessed. In this case, we observed a paternal transmission of 43 CAG repeats, which resulted in an increase to 73 CAG repeats in the child affected with HD.

HTT haplogroup/haplotype diversity

Background haplotypes were constructed for 224 individuals with African ancestry (67 disease alleles and 381 non-disease alleles). Sixteen different haplotypes were identified across the four previously defined haplogroups (A, B, C, and C-SA) as well as an “other” haplogroup (Table 2). The “other” category was applied when the composition of tag-SNP alleles did not fall into any of the previously defined haplogroups/haplotypes.

Table 2

Summary of the HTT haplogroups/haplotypes and associated allele structures in disease and non-disease alleles

Haplogroups	Haplotypes	Allele structures	Non-disease		Disease
Haplogroups	Haplotypes	Allele structures	n	%	n	%
A	∗A2a	Q¹-2-2-7-2	–	–	1	1.5
	∗A2b	Q¹-2-2-7-2	13	3.4	1	1.5
	A4a	Q¹-2-2-7-2Q¹-2-2-12-2Q¹-2-2-13-2	511	1.30.30.3	3––	4.5––
	A4b	Q¹-2-2-7-2	2	0.5	5	7.5
	A6	Q¹-2-2-7-2	34	8.9	–	–
B	B1	Q¹-2-2-9-2	1	0.3	1	1.5
B	B2	Q¹-2-0-9-2Q¹-0-0-9-2Q¹-4-2-4-3	25––	6.6––	2921	43.33.01.5
C	C2	Q¹-4-2-7-3Q¹-2-2-8-2	215	5.51.3	––	––
	C4	Q¹-2-2-9-2	21	5.5	1	1.5
	C4c	Q¹-2-2-6-2	11	2.9	–	–
	C5	Q¹-2-2-10-2Q¹-2-2-11-2Q¹-2-2-10-3Q¹-2-0-9-2Q¹-4-2-10-2	6918124	18.14.70.30.51.0	19––––	28.4––––
	C8	Q¹-2-2-9-2	7	1.8	–	–
C-SA	C3	Q¹-2-2-10-2Q¹-2-2-9-3	190	0.323.6	––	––
	C9	Q¹-2-2-6-3Q¹-2-2-7-2	27	0.51.8	4–	6.0–
	C10	Q¹-2-2-4-3	1	0.3	–	–
Other	O	Q¹-2-2-7-2Q¹-2-2-9-3Q¹-2-2-10-2	3621	9.40.50.3	–––	–––
Total			#381	100.0	67	100.0

The two haplotypes that had not been previously identified in African ancestry individuals are indicated by an asterisk (∗). The most common non-disease and disease haplogroup/haplotype are indicated in underlined italics. The most common disease allele structure Q1-2-0-9-2 (29 out of 67 = 43.3%) is indicated in italics. Two samples (one disease allele and three non-disease alleles) from the sequence diversity analysis presented in Table 1 were excluded due to unsuccessful tag-SNP genotyping (#).

Summary of the HTT haplogroups/haplotypes and associated allele structures in disease and non-disease alleles The two haplotypes that had not been previously identified in African ancestry individuals are indicated by an asterisk (∗). The most common non-disease and disease haplogroup/haplotype are indicated in underlined italics. The most common disease allele structure Q1-2-0-9-2 (29 out of 67 = 43.3%) is indicated in italics. Two samples (one disease allele and three non-disease alleles) from the sequence diversity analysis presented in Table 1 were excluded due to unsuccessful tag-SNP genotyping (#). The largest proportion of non-disease alleles occurred on haplogroup C (159 out of 381 = 41.7%) and, within haplogroup C, haplotype C5 was the most common (94 out of 381 = 24.8%) (Table 2). For disease alleles, the largest proportion occurred on haplogroup B (33 out of 67 = 49.3%) and, within haplogroup B, haplotype B2 was the most common (32 out of 67 = 47.8%). The most common allele structure (Q1-2-0-9-2) in the disease alleles, characterized by the CCGCCA loss, was found exclusively on haplotype B2 and was therefore further assessed to determine whether it was African specific. Haplotype B2 was found to be the most frequent in the seven African and African ancestry populations of the 1000 Genomes Project (Figure 2). The frequency ranged from 6.6% in Americans of African Ancestry in Southwest US (ASW), to 9.9% in the African Caribbean in Barbados (ACB). Among the non-disease alleles included in the present study, a comparable frequency of 6.6% was identified for haplotype B2. Apart from Puerto Rico (PUR), where its frequency was 3.4%, haplotype B2 was rare (frequency ≤ 1%) in all the other non-African populations. This indicates that, although this analysis was only conducted in non-disease alleles, haplotype B2 was revealed to be of African origin and largely African specific.

Figure 2

Frequency of the HTT haplotype B2 in the populations of the 1000 Genomes Project

The African B2 haplotype was defined by SNPs rs2857936-rs762855-rs4690073 as described by Baine et al. The haplotype frequencies were obtained using the LDhap tool from the LDlink suite (ldlink.nci.nih.gov). Haplotype B2 was shown to have the highest frequencies among the African and African ancestry populations, ranging between 6.6% and 9.9%. Outside of the continental African populations, Puerto Rico (American) had the highest frequency of haplotype B2 (3.4%), followed by the five East Asian populations (range from 0.5% to 1.0%). The Columbian (American), Utah residents (European), and Sri Lankan (South Asian) populations had low frequencies (0.5%), and B2 was not detected in the rest of the populations analyzed. The results were comparable with the frequency of B2 in the African ancestry non-disease alleles included in this study. This indicates that, although this analysis was only conducted in non-disease alleles, haplotype B2 may be of an African origin and an African-specific haplotype.

Frequency of the HTT haplotype B2 in the populations of the 1000 Genomes Project The African B2 haplotype was defined by SNPs rs2857936-rs762855-rs4690073 as described by Baine et al. The haplotype frequencies were obtained using the LDhap tool from the LDlink suite (ldlink.nci.nih.gov). Haplotype B2 was shown to have the highest frequencies among the African and African ancestry populations, ranging between 6.6% and 9.9%. Outside of the continental African populations, Puerto Rico (American) had the highest frequency of haplotype B2 (3.4%), followed by the five East Asian populations (range from 0.5% to 1.0%). The Columbian (American), Utah residents (European), and Sri Lankan (South Asian) populations had low frequencies (0.5%), and B2 was not detected in the rest of the populations analyzed. The results were comparable with the frequency of B2 in the African ancestry non-disease alleles included in this study. This indicates that, although this analysis was only conducted in non-disease alleles, haplotype B2 may be of an African origin and an African-specific haplotype.

CAG somatic expansion

The modifiers of the ratio of somatic CAG expansion of disease-associated alleles were assessed through the inclusion of the following explanatory variables; inherited expanded CAG repeat length, the age at sampling, and their interaction (Table S3 Model 1). The inherited expanded CAG repeat length and the age at sampling were shown to have a highly significant association with the ratio of somatic CAG expansions of the disease-associated allele observed in blood DNA (p < 2 × 10−16). A larger effect was observed for the inherited expanded CAG repeat length with every additional CAG repeat resulting in an increase of 0.131 (p = 8 × 10−16) in the ratio of somatic expansions; while every year delay in the age at sampling increased the ratio of somatic expansion by 0.008 (p = 1.8 × 10−3). In line with previous studies,, the inherited CAG repeat length was shown to be the primary driver of the ratio of somatic expansion. Moreover, a highly significant association (p < 2 × 10−16) was identified between allele structures and the ratio of somatic expansion (Figure S2; Table S3, Model 2). In addition to the CAG repeat, the disease allele structures, Q1-0-0-9-2 (p = 7.7 × 10−4), Q1-2-0-9-2 (p = 1 × 10−5), and Q1-2-2-6-3 (p = 0.014) were each shown to have a significant association with the ratio of somatic expansion. Individuals with these disease allele structures had a mean decrease in somatic expansion by 0.26, 0.17, and 0.15, respectively. Thus, individuals with the typical allele structure (Q1-2-2-P2-2) had a significantly higher ratio of somatic expansion overall, while individuals with disease alleles characterized by the loss of CCGCCA sequence had the lowest ratio of somatic expansion.

Potential modifiers of the HD phenotype

HTT repeat tract modification

A highly significant negative association (p = 3 × 10−14) was detected between the CAG repeat and the AoD, accounting for approximately 60% of the variation in the AoD (Figure S3). The degree of variation in AoD explained by CAG is directly comparable with the degree of variation in AoO explained by CAG in European ancestry populations (Figure S4), further highlighting the clinical utility of AoD. The other components of the HTT repeat tract were assessed individually for their association with the AoD (in years) (Table S4, Model 1). In addition to the CAG repeat length (2.9 years earlier, p = 6 × 10−12), the CCGCCA sequence was shown to have a significant association with the AoD (4.0 years earlier for loss of CCGCCA, p = 7 × 10- 4). Thus, each additional CAG repeat and loss of the CCGCCA sequence resulted in an earlier AoD in individuals affected with HD. Although not surprisingly, given the very small sample size (n = 24), not statistically significant, a similar trend of decreased AoO for individuals with the CCGCCA-loss allele was observed (Figure S5; Table S4, Models 2 and 3). The association with each allele structure (all components of the repeat tract together) on the AoD was also assessed. A highly significant correlation was identified between the inherited CAG repeat length within each of the allele structures and the AoD (p = 1 × 10−10) (Figure 3A). The CCGCCA-loss allele structure (Q1-2-0-9-2) was the only allele structure that had a detectable significant association with the AoD in comparison with the grouped typical allele structures (7.1 years earlier, p = 8 × 10−4) (Table S4, Model 4).

Figure 3

The HTT allele structure associated with age at HD diagnosis and somatic expansion of the HD allele in blood DNA in African ancestry individuals

(A) Linear regression analysis testing the association between the log transformed AoD and the inherited CAG repeat length for each disease allele structure revealed a significant association (r2 = 0.61, p = 1.36 × 10−10). The Q1-0-0-9-2 and Q1-2-0-9-2 disease allele structures characterized by the loss of one or more of the intervening sequences had the earliest AoD.

(B) The estimated marginal mean AoD for the disease allele structures, corrected for repeat size. The Q1-2-0-9-2 allele structure had the earliest mean AoD (n = 30, 45.5 years: 95% CI = 43.0–48.2), followed by Q1-0-0-9-2 (n = 2, 47.1 years: 95% CI = 38.1–58.1), Q1-4-2-4-3 (n = 1, 50.4 years: 95% CI = 37.4–67.9), Q1-2-2-P2-2 (n = 31, 53.0 years: 95% CI = 49.9–56.3), and Q1-2-2-6-3 (n = 4, 56.9 years: 95% CI = 48.9–66.0).

(C) Linear regression analysis testing the association between the log transformed AoD (corrected for CAG repeat length and allele structure) and expansion score. Overall, a significant association (p = 0.012) was identified.

(D) The estimated marginal mean expansion score for the allele structures, corrected for CAG repeat length and age at sampling. The Q1-0-0-9-2 (n = 2, 0.32: 95% CI = 0.227–0.458) and Q1-2-0-9-2 (n = 30, 0.42: 95% CI = 0.380–0.460) allele structures had the lowest mean expansion score followed by Q1-2-2-6-3 (n = 4, 0.44: 95% CI = 0.348–0.545), Q1-4-2-4-3 (n = 1, 0.44: 95% CI = 0.288–0.680), and Q1-2-2-P2-2 (n = 31, 0.60: 95% CI = 0.535–0.669).

The HTT allele structure associated with age at HD diagnosis and somatic expansion of the HD allele in blood DNA in African ancestry individuals (A) Linear regression analysis testing the association between the log transformed AoD and the inherited CAG repeat length for each disease allele structure revealed a significant association (r2 = 0.61, p = 1.36 × 10−10). The Q1-0-0-9-2 and Q1-2-0-9-2 disease allele structures characterized by the loss of one or more of the intervening sequences had the earliest AoD. (B) The estimated marginal mean AoD for the disease allele structures, corrected for repeat size. The Q1-2-0-9-2 allele structure had the earliest mean AoD (n = 30, 45.5 years: 95% CI = 43.0–48.2), followed by Q1-0-0-9-2 (n = 2, 47.1 years: 95% CI = 38.1–58.1), Q1-4-2-4-3 (n = 1, 50.4 years: 95% CI = 37.4–67.9), Q1-2-2-P2-2 (n = 31, 53.0 years: 95% CI = 49.9–56.3), and Q1-2-2-6-3 (n = 4, 56.9 years: 95% CI = 48.9–66.0). (C) Linear regression analysis testing the association between the log transformed AoD (corrected for CAG repeat length and allele structure) and expansion score. Overall, a significant association (p = 0.012) was identified. (D) The estimated marginal mean expansion score for the allele structures, corrected for CAG repeat length and age at sampling. The Q1-0-0-9-2 (n = 2, 0.32: 95% CI = 0.227–0.458) and Q1-2-0-9-2 (n = 30, 0.42: 95% CI = 0.380–0.460) allele structures had the lowest mean expansion score followed by Q1-2-2-6-3 (n = 4, 0.44: 95% CI = 0.348–0.545), Q1-4-2-4-3 (n = 1, 0.44: 95% CI = 0.288–0.680), and Q1-2-2-P2-2 (n = 31, 0.60: 95% CI = 0.535–0.669). The estimated marginal mean AoD for each disease allele confirmed that individuals with the commonest African allele structure, Q1-2-0-9-2, have the earliest mean AoD of 45.5 years, while individuals with the Q1-2-2-6-3 allele structure had the most delayed mean AoD of 56.9 years (Figure 3B).

HTT haplogroup/haplotype modification

Haplogroup A, C, and haplogroup variant C-SA were shown to have a significant positive association with the HD phenotype (delayed the AoD) when compared with the most common haplogroup B (Table S4, Model 5). Individuals with an expanded HTT allele occurring on haplogroup B had a significantly earlier AoD compared with haplogroup C: 6.2 years (p = 0.022); haplogroup A, 8.6 years (p = 0.014); and haplogroup C variant C-SA, 11.8 years (p = 0.012). Individuals with an expanded HTT allele on haplotype B2 had a significantly earlier AoD compared with the other haplotypes: A4a, 16.8 years (p = 0.018); C5, 7.8 years (p = 6.8 × 10−3); A4b, 9.1 years (p = 0.029); C9, 12.3 years (p = 7.9 × 10−3); and B1, 22.3 years (p = 0.019) (Table S4, Model 6). The estimated marginal mean AoD for each disease haplotype confirmed that haplotype B2 had the earliest AoD of 45.5 years (n = 29, 95% confidence interval [CI] = 43.0–48.1), while individuals with haplotype B1 had the most delayed mean AoD of 65.5 years (n = 1, 95% CI = 48.6–88.2) (Figure S6).The earliest mean AoD in individuals with haplotype B2 was the same for individuals with the most common allele structure, Q1-2-0-9-2, as these alleles occurred exclusively on the haplotype background B2. To assess whether the allele structure itself or another variant on haplotype B2 was a more likely explanation for the disease-hastening effect detected, a goodness of fit test on 5,000 bootstrapped samples was conducted. The assessment of the CCGCCA-loss allele structure (Q1-2-0-9-2) compared with haplotype B2 as a better explanation of the earlier AoD revealed neither to have more of a significant association (Figure S7). There was no statistical indication that the haplotype B2 was more strongly associated with the AoD than the CCGCCA-loss allele structure (Q1-2-0-9-2).

CAG somatic expansion modification

The effect of the ratio of somatic expansion on the AoD was then considered through the assessment of the expansion score. The results revealed a highly significant correlation (p = 1.296 × 10−9) and an R-square value of 0.63. The inherited CAG repeat length, disease allele structures Q1-0-0-9-2 and Q1-2-0-9-2, and the expansion score were all shown to have a significant association with the AoD (Table 3, Model 1). Every CAG repeat increase resulted in an earlier AoD by 3.5 years (p = 2 × 10−11), while the allele structures Q1-2-0-9-2 and Q1-0-0-9-2 resulted in an earlier AoD by 10.2 years (p = 4 × 10−5) and 11.5 years (p = 0.034) respectively, compared with the grouped typical allele structure Q1-2-2-P2-2. Lastly, every unit increase in the expansion score resulted in an earlier AoD by 10.6 years (p = 0.012).

Table 3

Multiple linear models testing the association between the HD phenotype and various explanatory variables

Model		r²	p value for model	Parameter values
Model		r²	p value for model	Sample size	Explanatory variable	Effect in years	p value for explanatory variable
1	Ln (AoD)∼ CAG + allele structures + expansion score	0.625	1.296 × 10⁻⁹	60	CAG	−3.504	1.56 × 10⁻¹¹
				2	Q¹-0-0-9-2	−11.491	0.034
				30	Q¹-2-0-9-2	−10.180	4.20 × 10⁻⁵
				4	Q¹-2-2-6-3	−0.840	0.846
				1	Q¹-4-2-4-3	−5.903	0.411
					Expansion score	−10.600	0.012
2	Ln (AoD)∼ CAG + haplotypes + expansion score	0.664	2.989 × 10⁻⁸	60	CAG	−3.665	8.93 × 10⁻¹¹
				1	A2a	2.050	0.784
				1	A2b	11.664	0.163
				1	A4a	12.985	0.137
				4	A4b	15.765	3.28 × 10⁻³
				1	B1	29.224	3.37 × 10⁻³
				1	C4	2.773	0.719
				16	C5	14.250	1.41 × 10⁻⁴
				4	C9	11.517	9.37 × 10⁻³
					Expansion score	−12.090	7.81 × 10⁻³

The statistically significant explanatory variables are indicated in italics. Model 1. Linear model testing the association of the CAG repeat length, allele structure and expansion score on the AoD, relative to the grouped typical allele structure Q1-2-2-P2-2. The R-square and p values of the overall model show a significant association (r2 = 0.63, p = 1 × 10−9), the CAG repeat length, allele structures Q1-0-0-9-2 and Q1-2-0-9-2, and expansion score had a significant association. Model 2. Linear model testing the association of the CAG repeat length, background haplotype, and expansion score on the AoD, relative to the most common haplotype B2. The R-square and p values of the overall model show a significant association (r2 = 0.66, p = 3 × 10−8), and the CAG repeat length; haplotypes A4b, B1, C5, and C9; and expansion score had a significant association.

Multiple linear models testing the association between the HD phenotype and various explanatory variables The statistically significant explanatory variables are indicated in italics. Model 1. Linear model testing the association of the CAG repeat length, allele structure and expansion score on the AoD, relative to the grouped typical allele structure Q1-2-2-P2-2. The R-square and p values of the overall model show a significant association (r2 = 0.63, p = 1 × 10−9), the CAG repeat length, allele structures Q1-0-0-9-2 and Q1-2-0-9-2, and expansion score had a significant association. Model 2. Linear model testing the association of the CAG repeat length, background haplotype, and expansion score on the AoD, relative to the most common haplotype B2. The R-square and p values of the overall model show a significant association (r2 = 0.66, p = 3 × 10−8), and the CAG repeat length; haplotypes A4b, B1, C5, and C9; and expansion score had a significant association. Similarly, when the background haplotype was considered in the assessment, a highly significant correlation (p = 3 × 10−8) and an R-square value of 0.66 was identified (Table 3, Model 2). Every CAG repeat increase resulted in an earlier AoD by 3.7 years, while the background haplotypes A4b, B1, C5, and C9 resulted in a delayed AoD by 15.8 years (p = 3 × 10−5), 29.2 years (p = 3 × 10−3), 14.3 years (p = 1 × 10−4), and 11.5 years (p = 9 × 10−3) respectively, compared with the background haplotype B2. Lastly, every unit increase in the expansion score resulted in an earlier AoD by 12.1 years (p = 8 × 10−3). The association of the expansion score with the AoD, corrected to CAG repeat size and allele structure, revealed an overall significant negative correlation (p = 0.012), illustrating the expansion score result observed in Table 3, Model 1 (Figure 3C). The estimated marginal mean expansion scores for the disease allele structures confirmed that the largest mean expansion score was identified in the grouped typical allele structure Q1-2-2-P2-2 at 0.60, while the lowest expansion scores were associated with the atypical allele structures Q1-2-0-9-2 at 0.42 and Q1-0-0-9-2 at 0.32 (Figure 3D). Thus, although somatic expansion was shown to be significantly associated with the AoD, overall, nonetheless individuals with the commonest African Q1-2-0-9-2 allele structure that had the earliest AoD also had one of the lowest expansion scores in blood DNA. The earlier AoD seen in these individuals could thus not be attributed to somatic expansion in blood DNA.

Discussion

This study set out to characterize the HTT repeat tract sequence in African ancestry HD disease and non-disease alleles, and ultimately assess potential cis-acting genetic modifiers of the HD phenotype. A large amount of sequence diversity was observed with 17 different allele structures identified: eight were defined as typical (variation only in the number of CAG/CCG repeats), while nine were atypical (variation present throughout the HTT repeat tract). Less variation was identified in the non-disease alleles, with typical allele structures being more frequent, while atypical allele structures were more frequently observed in disease alleles. Across the non-disease alleles, the typical allele structure Q1-2-2-7-2 was the most common. This allele structure has been previously shown to be the most common in both European ancestry non-disease (∼92%) and disease alleles (∼97%). In contrast, the atypical allele structure Q1-2-0-9-2, characterized by the CCGCCA loss, was the most common (∼44%) in African disease alleles. Although this allele structure has been previously identified in European ancestry individuals, it is very rare, especially among individuals affected with HD (0 out of 746). Three of the 17 allele structures identified in the disease and non-disease alleles were unique to this study (Table 1). This is possibly due to these allele structures being very rare in previously studied populations or, more likely, specific to African ancestry individuals. The differences between atypical allele frequencies in an African population and those recently reported European alleles (European atypical non-disease ∼ 8%, disease ∼ 3%; versus African atypical non-disease ∼ 39%, disease ∼ 54%) highlight the importance of research across different populations to improve understanding of the full range of diversity. Analysis of the broader HTT locus in individuals of African ancestry revealed that the largest proportion of non-disease alleles occurred on haplogroup C and haplotype C5, while the largest proportion of disease alleles occurred on haplogroup B and haplotype B2 (Table 2). A comparison of the European ancestry haplotypes revealed the largest proportion of non-disease alleles occur on haplogroup C, while the largest proportion of disease alleles occurred on haplogroup A. The most common disease allele structure, characterized by the CCGCCA loss, occurred exclusively on haplotype B2. Although haplotype B2 has been identified in individuals of European ancestry, it is rare and differs by at least one tag-SNP (J.A. Collins, personal communication; M.R. Hayden, personal communication; G.E.B. Wright, personal communication). The assessment of haplotype B2 in other populations worldwide, showed that it is frequent (≥6.6%) in African populations and rare (≤1%) in non-African populations. The presence of haplotype B2 at a frequency of 3.4% among Puerto Ricans (Figure 2) is in line with the fact that ∼10% of the genome of these individuals is of African ancestry. The higher frequency in the African populations provides support for haplotype B2 being African specific and of African origin. We have also identified the presence of haplotype variants A2a and A2b in two of our African individuals affected with HD, suggesting that, although rare, European high-risk haplotypes are present in African ancestry individuals. Prior to this study, A2a and A2b were described to be absent from East Asian and African ancestry populations. The presence of these haplotypes is potentially a result of admixture with European populations. Alternatively, these haplotypes may have been present in ancestral African populations and increased in frequency in European populations due to population bottlenecks arising during migration out of Africa. Recent data have confirmed somatic expansion of the HTT CAG repeat as a potential driver of HD severity. In European ancestry individuals affected with HD, individual-specific rates of somatic expansion in blood DNA are inversely correlated with AoO, and positively correlated with disease progression. Here, we have demonstrated that, overall, there is a significant inverse association between individual-specific levels of somatic expansion in blood DNA and AoD as a proxy for age at onset in an African ancestry HD cohort. Whether individual-specific rates of somatic expansion in African individuals affected with HD are driven by the same set of DNA repair gene variants as observed in European populations,, is yet to be determined. However, given the higher genetic diversity observed in African populations, it seems likely that additional African-specific genetic variants may be in operation. It has also recently been determined that HD severity is best explained by the length of the pure CAG repeat tract (Q1) and not by the length of the polyglutamine tract encoded (Q1 + Q2).,, Since the degree of somatic expansion is also best predicted by pure CAG length (Q1), these data suggest that somatic CAG expansion is potentially more important in relation to disease severity and progression than the number of glutamines encoded in the inherited allele. As all of the CAACAG duplications observed previously in the European ancestry population were present on a typical CCGCCA polyproline encoding background, the data presented here do not alter the interpretation of the primary effect of the CAACAG duplication. However, since the very rare CAACAG loss is observed on alleles both with and without the CCGCCA sequence in European ancestry populations,,,, it is possible that some of the effects attributed to the CAACAG loss might be due to and/or exacerbated by the CCGCCA loss. Indeed, even after correcting for the number of pure CAG repeats, loss of the CAACAG sequence was still associated with worse HD outcomes. Unfortunately, the number of individuals with the double loss of the CAACAG and CCGCCA sequences (3 out of 746), versus those with only the CAACAG loss (4 out of 746) and those with only the CCGCCA loss (0 out of 746), precludes a reanalysis of our previously published data. Only two disease alleles lacking the CAACAG sequence (Q2 = 0) were detected in this study, precluding an assessment of the impact of this structure on HD severity. Rather, we determined that individuals carrying disease allele structures characterized by loss of the CCGCCA sequence (P1 = 0) had an earlier AoD by 4.0 years compared with individuals with the CCGCCA sequence (P1 = 2). Significant associations were also identified when comparing the disease allele structure Q1-2-0-9-2, characterized by loss of the CCGCCA sequence, with the reference allele structure, Q1-2-2-P2-2, with individuals having an earlier AoD by 7.1 years. One limitation of our study is that we were not able to obtain detailed clinical information on our HD cohort, and the widely used measure of AoO was only available for a small subset. Clearly, future studies would be facilitated by more in-depth phenotyping. Nonetheless, the robust and highly significant genetic associations we have revealed here confirm that AoD is a clinically meaningful measure capable of providing meaningful insights into HD biology. The CCGCCA loss is thus proposed as a cis-acting modifier of the HD phenotype in African ancestry individuals. Very recently, an exome sequencing strategy applied to a cohort of HD individuals of European ancestry with either extreme early or extreme late AoO relative to their measured CAG length confirmed effects for the duplication and loss of the CAACAG sequence. Interestingly, these analyses also revealed 2 out of 213 individuals with extreme early onset with the Q1-2-0-9-2 structure. This structure was not observed in 206 individuals in the extreme late cohort, nor in 746 individuals in our unselected European ancestry HD cohort. These data suggest that the Q1-2-0-9-2 structure is over-represented in an extreme early cohort relative to an unselected cohort (2 out of 213 versus 0 out of 746, p = 0.049, Fisher’s exact test), and in an extreme early cohort relative to a combined unselected/extreme late cohort (2 out of 213 versus 0 out of 952, p = 0.033, Fisher’s exact test). These data thus suggest that the CCGCCA loss may also be a cis-acting modifier of HD motor onset in European individuals affected with HD. Since inter-locus CAG repeat length instability is modified by the flanking sequence, it seems plausible that polymorphisms within the sequence could mediate changes in somatic instability. Previous inter-locus analyses of the relative expandability of multiple disease-associated CAG•CTG repeats (HD, DM1, SCA1, 2, 3, 7, etc.) have revealed associations between higher repeat instability and higher guanine and cytosine (GC) content in the immediate DNA flanking the CAG•CTG repeat. It thus seems that a reasonable extension of this observation might be that genetic variants that alter the GC content of the flanking DNA between alleles at one locus might similarly drive differences in somatic instability. Our data support this model, in that the CCGCCA loss was associated with altered somatic expansion scores. However, contrary to the prediction that higher GC content in the flanking sequence, as mediated by the CCGCCA loss, would increase expandability, we found loss of the CCGCCA sequence was actually associated with lower levels of somatic expansion. Thus, unless this effect is reversed in the critical brain regions, we speculate that the disease-accelerating association of the CCGCCA loss is mediated by a pathway other than somatic instability. As the CCGCCA loss is a synonymous variant that does not alter the coding potential of the pure polyglutamine or pure polyproline tract, there is no obvious mechanism by which this variant could affect the amino sequence of the HTT protein. The total number of prolines encoded by the Q1-2-0-9-2 alleles is 11, exactly the same as that encoded by the most common typical expanded allele structure, Q1-2-2-7-2, in European ancestry populations. Combined with the observation that number of prolines encoded by the variable CCG repeat (P2) has not been revealed as a modifier of HD onset (model 1 Table S4 and Panegyres et al.), it is unlikely that the phenotypic consequence of the CCGCCA loss is mediated simply by the number of proline in the HTT protein. As has previously been speculated for the residual modifying effect of the CAACAG sequence (Q2) after correcting for pure CAG length, the effect of the CCGCCA loss could be driven by mechanisms that effect the efficiency of HTT transcription, mRNA folding or splicing, or canonical and/or repeat-associated non-AUG (RAN) translation.37, 38, 39, 40 In particular, the CAACAG-CCGCCA intervening sequence lies at a key position in the HTT mRNA that demarcates the boundary between the a long CAG hairpin that is observed in expanded disease-associated alleles, but not in non-disease-associated alleles.37, 38, 39, 40 The CCGCCA effects on mRNA folding in this region could affect RAN translation, which has recently been shown to be highly sensitive to repeat sequence variation at the ATXN8 locus. Instead, there may be effects on protein translation. Polyproline regions are known to stall translation, an effect that might be further modulated by the relative frequency of CCA and CCG proline tRNAs with potential downstream consequences on HTT protein folding. Alternatively, it is possible there is an effect mediated by a linked variant. In other repeat expansion disorders such as SCA1, SCA2 and DM1, interruptions in the repeat tract have been shown to be associated with the disease phenotype. In SCA1, interruptions in the repeat tract confer increased stability, delay AoO, and slow down the rate of aggregation. In SCA2, CAA interruptions were shown to be associated with a parkinsonism disease phenotype, while in individuals affected with DM1 carrying repeat interruptions there was a later AoO than expected for the repeat length and a reduced level of somatic expansion. The CCG and CGG interruptions have been shown to have a stabilizing effect in the blood and often lead to milder symptoms. Although the CCGCCA loss was not associated with an increased level of somatic expansion in blood DNA, we did identify a relatively rare large germline expansion where a paternal transmission of 43 CAG repeats to 73 CAG repeats resulted in juvenile HD (JHD) in an HD family carrying the CCGCCA-loss disease allele. Approximately 80% of JHD cases are the result of a paternal transmission, which can be attributed to substantial increases in repeat length occurring during male gametogenesis.,, A previous case report showed the CCGCCA loss on haplogroup B was associated with a very unusual paternal transmission of 26 CAG repeats to 44 CAG repeats in the child. These data suggest that the CCGCCA-loss allele may be associated with higher rates of germline expansion, as has also been proposed for CAACAG loss alleles. The analysis of background haplotypes revealed that disease allele structures (Q1-0-0-9-2 and Q1-2-0-9-2) characterized by the CCGCCA loss were both present on haplotype B2, as well as being negatively associated with the HD phenotype, compared with haplotypes A4a, A4b, B1, C5, and C9. Haplotype B2 can thus be designated a high-risk haplotype (for early diagnosis) in African ancestry individuals due to its virtually complete association with the CCGCCA loss in disease alleles. The CCGCCA loss and haplotype B2 effects could not be separated out as there is no statistical indication that the earlier AoD exhibited in these individuals is better explained by the CCGCCA loss allele structure or haplotype B2. It is thus possible that the CCGCCA loss is in linkage disequilibrium with another variant on haplotype B2 that affects disease biology. For instance, a linked promoter or enhancer variant might affect HTT transcription rates. Although HD has been extensively studied in European ancestry individuals, the allele sequence diversity within the HTT repeat tract in African ancestry individuals has not been previously described. Substantial diversity, shown by the presence of predominantly atypical allele structures, is reported. Intriguingly, the most common HD disease allele structure in an African ancestry HD population in South Africa is characterized by the loss of the CCGCCA sequence. This CCGCCA-loss allele structure is associated with an earlier AoD (by 7.1 years) among South African affected individuals of African ancestry, and possibly earlier age at motor onset among European individuals affected with HD. Among the HD alleles of African ancestry we have analyzed, this CCGCCA-loss allele structure occurs exclusively on haplotype B2, which we propose as a high-risk haplotype in African ancestry individuals. Despite our observation that overall somatic expansion had a significant inverse association with the HD phenotype in African individuals, in general, the CCGCCA-loss allele structure had the lowest ratio of somatic expansion in blood DNA, suggesting that the disease-accelerating association of the CCGCCA-loss allele is not mediated by an increase in somatic expansion. We propose the CCGCCA-loss allele occurring on haplotype B2 is a cis-acting modifier of HD in our African ancestry individuals that accelerates disease diagnosis through a mechanism that is not driven by somatic instability. Further larger studies in well phenotyped African and European ancestry populations will be required to determine whether the associations observed here are driven directly by the CCGCCA loss and/or by broader haplotype effects. Importantly, this study represents a single African population and thus further ascertainment of African individuals affected with HD and studies of non-disease alleles in Africa are warranted. Nonetheless, these findings already contribute uniquely to the body of knowledge of HD and provide population-specific sequence data for individuals previously understudied.

Data and code availability

The HTT repeat tract was genotyped from the MiSeq reads generated using ScaleHD (v0.251) (https://github.com/helloabunai/ScaleHD). The HTT repeat tract sequence alignments were visualized in Tablet (v1.17.08.17) (https://ics.hutton.ac.uk/tablet/). Statistical analyses were undertaken in R (v3.4.3) (https://www.r-project.org) using RStudio (v1.0.153) (https://www.rstudio.com). The dataset and code supporting the current study have not been deposited in a public repository as broad ethical consent has not been granted as the study participants were selected retrospectively from banked samples but is available from the corresponding author on request.

46 in total

1. Slow peptide bond formation by proline and other N-alkylamino acids in translation.

Authors: Michael Y Pavlov; Richard E Watts; Zhongping Tan; Virginia W Cornish; Måns Ehrenberg; Anthony C Forster
Journal: Proc Natl Acad Sci U S A Date: 2008-12-22 Impact factor: 11.205

2. A SNP in the HTT promoter alters NF-κB binding and is a bidirectional genetic modifier of Huntington disease.

Authors: Kristina Bečanović; Anne Nørremølle; Scott J Neal; Chris Kay; Jennifer A Collins; David Arenillas; Tobias Lilja; Giulia Gaudenzi; Shiana Manoharan; Crystal N Doty; Jessalyn Beck; Nayana Lahiri; Elodie Portales-Casamar; Simon C Warby; Colúm Connolly; Rebecca A G De Souza; Sarah J Tabrizi; Ola Hermanson; Douglas R Langbehn; Michael R Hayden; Wyeth W Wasserman; Blair R Leavitt
Journal: Nat Neurosci Date: 2015-05-04 Impact factor: 24.884

Review 3. Juvenile onset Huntington's disease--clinical and research perspectives.

Authors: M A Nance; R H Myers
Journal: Ment Retard Dev Disabil Res Rev Date: 2001

4. The molecular epidemiology of Huntington disease is related to intermediate allele frequency and haplotype in the general population.

Authors: Chris Kay; Jennifer A Collins; Galen E B Wright; Fiona Baine; Zosia Miedzybrodzka; Folefac Aminkeng; Alicia J Semaka; Cassandra McDonald; Mark Davidson; Steven J Madore; Erynn S Gordon; Norman P Gerry; Mario Cornejo-Olivas; Ferdinando Squitieri; Sarah Tishkoff; Jacquie L Greenberg; Amanda Krause; Michael R Hayden
Journal: Am J Med Genet B Neuropsychiatr Genet Date: 2018-02-20 Impact factor: 3.568

5. Huntington disease in the South African population occurs on diverse and ethnically distinct genetic haplotypes.

Authors: Fiona K Baine; Chris Kay; Maria E Ketelaar; Jennifer A Collins; Alicia Semaka; Crystal N Doty; Amanda Krause; L Jacquie Greenberg; Michael R Hayden
Journal: Eur J Hum Genet Date: 2013-03-06 Impact factor: 4.246

6. CAG repeat expansion in Huntington disease determines age at onset in a fully dominant fashion.

Authors: J-M Lee; E M Ramos; J-H Lee; T Gillis; J S Mysore; M R Hayden; S C Warby; P Morrison; M Nance; C A Ross; R L Margolis; F Squitieri; S Orobello; S Di Donato; E Gomez-Tortosa; C Ayuso; O Suchowersky; R J A Trent; E McCusker; A Novelletto; M Frontali; R Jones; T Ashizawa; S Frank; M H Saint-Hilaire; S M Hersch; H D Rosas; D Lucente; M B Harrison; A Zanko; R K Abramson; K Marder; J Sequeiros; J S Paulsen; G B Landwehrmeyer; R H Myers; M E MacDonald; J F Gusella
Journal: Neurology Date: 2012-02-08 Impact factor: 9.910

7. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

8. Repeat Interruptions Modify Age at Onset in Myotonic Dystrophy Type 1 by Stabilizing DMPK Expansions in Somatic Cells.

Authors: Jovan Pešović; Stojan Perić; Miloš Brkušanin; Goran Brajušković; Vidosava Rakočević-Stojanović; Dušanka Savić-Pavićević
Journal: Front Genet Date: 2018-11-27 Impact factor: 4.599

9. A genetic association study of glutamine-encoding DNA sequence structures, somatic CAG expansion, and DNA repair gene variants, with Huntington disease clinical outcomes.

Authors: Marc Ciosi; Alastair Maxwell; Sarah A Cumming; Davina J Hensman Moss; Asma M Alshammari; Michael D Flower; Alexandra Durr; Blair R Leavitt; Raymund A C Roos; Peter Holmans; Lesley Jones; Douglas R Langbehn; Seung Kwak; Sarah J Tabrizi; Darren G Monckton
Journal: EBioMedicine Date: 2019-10-10 Impact factor: 8.143

10. Genetic substructure and complex demographic history of South African Bantu speakers.

Authors: Dhriti Sengupta; Ananyo Choudhury; Cesar Fortes-Lima; Shaun Aron; Gavin Whitelaw; Koen Bostoen; Hilde Gunnink; Natalia Chousou-Polydouri; Peter Delius; Stephen Tollman; F Xavier Gómez-Olivé; Shane Norris; Felistas Mashinya; Marianne Alberts; Scott Hazelhurst; Carina M Schlebusch; Michèle Ramsay
Journal: Nat Commun Date: 2021-04-07 Impact factor: 14.919