Literature DB >> 30845214

A novel nonsense variant in SUPT20H gene associated with Rheumatoid Arthritis identified by Whole Exome Sequencing of multiplex families.

Maëva Veyssiere¹, Javier Perea¹, Laetitia Michou², Anne Boland³, Christophe Caloustian³, Robert Olaso³, Jean-François Deleuze³, François Cornelis⁴, Elisabeth Petit-Teixeira¹, Valérie Chaudru¹.

Abstract

The triggering and development of Rheumatoid Arthritis (RA) is conditioned by environmental and genetic factors. Despite the identification of more than one hundred genetic variants associated with the disease, not all the cases can be explained. Here, we performed Whole Exome Sequencing in 9 multiplex families (N = 30) to identify rare variants susceptible to play a role in the disease pathogenesis. We pre-selected 77 genes which carried rare variants with a complete segregation with RA in the studied families. Follow-up linkage and association analyses with pVAAST highlighted significant RA association of 43 genes (p-value < 0.05 after 106 permutations) and pinpointed their most likely causal variant. We re-sequenced the 10 most significant likely causal variants (p-value ≤ 3.78*10-3 after 106 permutations) in the extended pedigrees and 9 additional multiplex families (N = 110). Only one SNV in SUPT20H: c.73A>T (p.Lys25*), presented a complete segregation with RA in an extended pedigree with early-onset cases. In summary, we identified in this study a new variant associated with RA in SUPT20H gene. This gene belongs to several biological pathways like macro-autophagy and monocyte/macrophage differentiation, which contribute to RA pathogenesis. In addition, these results showed that analyzing rare variants using a family-based approach is a strategy that allows to identify RA risk loci, even with a small dataset.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Substances：

Year: 2019 PMID： 30845214 PMCID： PMC6405192 DOI： 10.1371/journal.pone.0213387

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

Rheumatoid Arthritis (RA) is one of the most frequent autoimmune disease, affecting 0.3 to 1% of the worldwide population. Since the discovery of HLA locus as a risk factor for autoimmune diseases [1] and specifically HLA-DRB1 for RA, more than 100 RA genetic factors were identified by Genome Wide Association Studies (GWASs) [2,3]. However, the effect of these genetic risk factors is too weak to explain the entire RA genetic component. Indeed, the heritability attributed to HLA-DRB1 shared-epitope (SE) alleles was estimated between 11% [4] and 37% [5]. While GWASs loci identified outside the HLA locus only explain an additional five percent of RA heritability [6]. Several hypotheses have been proposed to explain the unknown part of this complex disease genetic component. One of these hypotheses relies on the fact that rare variants, which are poorly detected by GWASs, contribute to the risk of complex diseases. During the last decade, the development of Next Generation Sequencing has facilitated the detection of such variants. Hence, several studies used exome sequencing to identify rare to low frequency variants and evaluate their contribution to RA risk. For this purpose, two studies used a candidate gene approach based on exome sequencing [7,8]. A first study, based on a population of European ancestry, showed an aggregation of non-synonymous rare variants contributing to RA risk into IL2RA and IL2RB loci [7]. Another one, which studied Korean RA cases and healthy controls, allowed the identification of a weak association of rare missense variants in 17 different genes [8]. In this latter group, the VSTM1 gene gave rise to an over-expression of its mRNA product in RA cases [9]. However, both studies restricted their analysis to a limited number of candidate genes. Thus, they may have missed variations contributing to RA risk present outside the loci of those candidates. Other studies [10,11] succeeded at identifying new RA risk loci by analyzing rare coding variants extracted after Whole Exome Sequencing (WES). Thus, an association between RA and PLB1 gene [11] and several other genes involved in the production of reactive oxygen species (ROS) such as NDUFA7 and SCO1 [10] has been observed using respectively a familial-based strategy and a case-control analysis. Both studies, conducted in non-European populations, support the strategy of sequencing the whole exome of RA cases to identify new RA candidate variants. In our study, we aimed at identifying new loci associated with RA in the French population by focusing our research on rare coding variants. For this purpose, we sequenced RA cases and healthy relatives from 9 multiplex families which carried HLA-DRB1 risk alleles. However, even in the cases who carried the SE alleles, we can observe some clinical and genetic heterogeneity. Hence, we should be able to identify both genetic factors modulating the effect of HLA-DRB1 SE and acting independently from HLA loci.

Results

Overview

In this study, we sequenced the whole exome of 19 RA cases and 11 healthy individuals belonging to 9 multiplex families (discovery set). We applied a three steps strategy (Fig 1) to prioritize the sequenced loci and identify RA risk variants. First, we selected genes carrying rare variants which were family specific and with a high damaging predicted effect. Then, we assessed the potential combined effect of rare and low-frequency variants in these genes on RA risk using pVAAST [12]. Finally, for the 10 genes with the strongest genetic association, we re-sequenced the leading effect variant in the validation set. This final step allowed us to validate their family specific co-segregation with RA.

Fig 1

Description of the study design.

This schema of the study presents the 3 main steps of the analysis and, the resulting number of variants selected through each sub-analysis. (A) The selection of candidate variants was performed on the discovery set and resulted in the identification of 77 candidate SNVs (B) The association analysis was applied on the discovery set and 98 controls extracted from the 1000 genomes project. The 43 genes significantly associated with RA were re-tested after removing the candidate variant identified in step A and, the candidate variant was tested alone (C) The SNVs with the strongest RA association were re-sequenced in the extended families of the discovery, plus 9 new multiplex families.

Description of the study design.

Description of the discovery set

The male: female ratio was 6:13 in the discovery set selected for WES (described in Table 1), which is similar to the ratio observed in RA affected populations [13,14]. All the RA-affected individuals carried at least 1 shared epitope allele of HLA-DRB1 (18 of 19 had 2 SE alleles). Fifteen affected participants (79%) were positive for Anti-Citrullinated Peptide Antibodies (ACPA) and Rheumatoid Factor (RF), one (5%) was positive for the RF alone and two (11%) were negative for both. These two seronegative cases belonged to the same family (referred as family 5). The affected members of families 3, 4 and 9 were diagnosed with early RA (the mean age at diagnosis was of 31, 36 and 26 years old, respectively).

Table 1

Description of the discovery set.

Pedigree id	Affected by RA^a											Unaffected by RA^a
	All	Male	Female	ACPA^b			RF^c			Age of diagnosis		All	Male	Female	ACPA^b			RF^c			Sampling age
	All	Male	Female	-	+	NA	-	+	NA	≥40 years old	<40 years old	All	Male	Female	-	+	NA	-	+	NA	≥40 years old	<40 years old
Only RA reported
2	2		2		2			2		1		1		1	1			1			1
3	3	2	1		3			3		1	1	1		1	1			1
4	2	2			2			2		1	1	1		1	1			1			1
5	2		2	2			2			2		1		1	1			1			1
6	2		2		2			2				1	1			1		1			1
7	2		2		2			2				2	1	1	1	1		2			1
RA and other AID reported
9	2	1	1		2			2			2											1
10	2		2	1	1			2		1	1	2	1	1	2			2			1	1
11	2	1	1		1	1		1	1			2		2	2			2			1	1

a: Rheumatoid Arthritis

b: Anti-Citrullinated Peptides Antibodies

c: Rheumatoid Factor

a: Rheumatoid Arthritis b: Anti-Citrullinated Peptides Antibodies c: Rheumatoid Factor To check whether the 98 selected controls would not inflate the type I error during the burden test, we performed a PCA based on these samples and the discovery set. The genetic variability (<2.12% of the total variability) observed in the PCA (Fig 2) was due to the variability between the selected families and not between families and controls.

Fig 2

Principal component analysis (PCA) results of samples processed for pVAAST analysis.

The PCA is based on the 30 samples from the discovery set and 98 CEU controls extracted from the 1000 genomes databases. This plot represents the 2 first principal components which respectively account for 2.12% and 1.87% of the genetic variability. The color code represents the population source: each family sequenced by WES has its own color described in the legend, the CEU controls are in grey. The shape of each dot represents the RA status of the represented individual.

Principal component analysis (PCA) results of samples processed for pVAAST analysis.

Whole Exome Sequencing and RA risk candidate variants detection

More than 91% of the targeted regions were sequenced with a coverage ≥ 30X. We extracted from these sequences 154,644 high-quality variants, including 64,747 (42%) in exons. We observed a mean of 26,679 exonic variants per sample and a SNV transition/transversion ratio equal to 3.01 which is consistent with previous studies [15]. Under the hypothesis that rare genetic variants linked to RA would segregate with the affected phenotype within the multiplex families, we selected in the pool of 21,507 rare variants (MAF < 1%), 2223 family-specific variants shared by all the RA-affected relatives (but absent from unaffected members). Knowing that rare variant with high predicted biological effect may contribute to the genetic predisposition of common disease, we extracted, from the 2223 variants, 77 SNVs with a high deleterious predicted effect on protein. We based our evaluation of this impact on SNPeff (“HIGH” or “MODERATE”) and CADD phred score (≥30). All these 77 rare family-specific variants were heterozygous in affected carriers. They were detected in 77 different genes, not previously associated with RA.

Evaluation of candidate variant association with RA

We reduced the set of candidate loci to 49 genes (64% of the genes tested with pVAAST) which were significantly associated with RA (pVAASTp-value < 0.041 after 106 permutations). Considering the difference in RA prevalence rate between men and women, we performed pVAAST test while controlling for the variable “sex”. A total of 43 genes still had a significant association with RA (pVAASTp-value < 0.05– see Table 2).

Table 2

Genes showing association signal in pVAAST analysis.

Gene		pVAAST result^b		Number of variants
Name	Chromosome: position^a	Without co-variable control	With sex co-variable control	Contributing to association	In the gene
SUSD5	3: 33,191,537–33,260,707	0.0003 (22.22)	0.00063 (22.22)	3	7
TMOD2	15: 52,043,758–52,108,565	0.0094 (7.07)	0.002 (7.07)	1	1
AURKB	17: 8,108,056–8,113,918	0.00053 (15.93)	0.0023 (15.93)	2	4
NEK1	4: 170,314,426–170,533,780	0.0021 (8.56)	0.0031 (8.56)	1	7
DHRS7C	17: 9,674,751–9,694,614	0.0094 (7.13)	0.0041 (7.13)	2	7
GOLGA3	12: 133,345,495–133,405,444	0.0057 (13.15)	0.0057 (13.15)	2	10
CDKN2B	9: 22,002,902–22,009,362	0.004 (8.37)	0.0057 (8.37)	1	1
CCDC189	16: 30,768,744–30,774,031	0.0032 (9.82)	0.006 (9.82)	1	1
INTS5	11: 62,414,320–62,420,774	0.031 (13.35)	0.0068 (13.35)	1	1
SUPT20H	13: 37,583,449–37,633,850	0.004 (15.15)	0.0068 (15.15)	1	3
EPB41L4A	5: 111,478,138–111,755,013	0.024 (18.19)	0.0068 (18.19)	2	8
KLHL1	13: 70,274,726–70,682,591	0.002 (12.31)	0.0071 (12.31)	2	5
SPATA13	13: 24,553,944–24,881,212	0.021 (13)	0.0087 (13)	2	12
PTK2B	8: 27,168,999–27,316,903	0.014 (12.72)	0.0097 (12.72)	2	12
NOLC1	10: 103,911,933–103,923,627	0.011 (12.89)	0.01 (12.89)	1	5
SMYD5	2: 73,441,350–73,454,365	0.016 (8.36)	0.011 (8.36)	2	3
GPR35	2: 241,544,848–241,570,676	0.011 (13.72)	0.012 (13.72)	1	12
PCCA	13: 100,741,269–101,182,686	0.036 (7.7)	0.013 (7.7)	1	3
PTGR1	9: 114,312,002–114,362,135	0.011 (7.89)	0.014 (7.89)	1	2
SBNO1	12: 123,773,656–123,849,390	0.0097 (11.01)	0.015 (11.01)	2	6
ABCC1	16: 16,043,434–16,236,931	0.014 (15.17)	0.015 (15.17)	2	8
EIF2AK2	2: 37,326,353–37,384,208	0.014 (9.01)	0.015 (9.01)	1	2
TRAK2	2: 202,241,930–202,316,302	0.014 (6.45)	0.016 (6.45)	1	6
C10orf53	10: 50,887,697–50,918,307	0.009 (13.76)	0.017 (13.76)	1	3
ADRA2B	2: 96,778,707–96,781,984	0.019 (7.03)	0.019 (7.03)	1	3
SLC9A6	X: 135067958–135129423	0.015 (6.57)	0.02 (6.57)	1	1
RSG1	1: 16,558,195–16,563,657	0.0074 (9.85)	0.023 (9.85)	1	2
ABHD5	3: 43,731,605–43,775,863	0.0094 (12.41)	0.024 (12.41)	1	3
PHOX2A	11: 71,950,121–71,956,708	0.017 (12.36)	0.026 (12.36)	1	1
CCDC24	1: 44,457,031–44,462,200	0.012 (6.41)	0.026 (6.41)	1	3
TIMM44	19: 7,991,603–8,008,805	0.015 (11.37)	0.027 (11.37)	2	8
STC2	5: 172,741,716–172,756,506	0.013 (8.74)	0.027 (8.74)	1	2
MNS1	15: 56,713,742–56,757,335	0.019 (12.17)	0.028 (12.17)	2	5
KCNA3	1: 111,214,310–111,217,655	0.036 (8.1)	0.028 (8.1)	1	1
FOXK2	17: 80,477,589–80,602,538	0.021 (11.71)	0.03 (11.71)	2	6
STRN3	14: 31,363,005–31,495,607	0.034 (5.39)	0.033 (5.39)	1	4
CIDEA	18: 12,254,318–12,277,594	0.031 (9.33)	0.034 (9.33)	1	3
VCL	10: 75,757,872–75,879,918	0.031 (9.33)	0.035 (9.33)	1	3
TUBGCP5	15: 22,833,395–22,873,892	0.016 (7.21)	0.036 (7.21)	2	3
SLC25A16	10: 70,237,756–70,287,231	0.024 (12.35)	0.037 (12.35)	1	1
GALNT10	5: 153,570,290–153,800,544	0.041 (3.93)	0.046 (3.93)	3	3
SECISBP2	9: 91,933,421–91,974,557	0.02 (7.32)	0.047 (7.32)	1	3
USE1	19: 17,326,155–17,330,638	0.026 (12.36)	0.049 (12.36)	1	2

The presented genes are ordered according to the association p-value of pVAAST test with sex co-variable control.

a: position on the human reference genome hg19

b: p-values computed from 106 permutations (score)

The presented genes are ordered according to the association p-value of pVAAST test with sex co-variable control. a: position on the human reference genome hg19 b: p-values computed from 106 permutations (score) We categorized the 43 genes into two groups by looking at the association scores of their variants (Table 3). Sixteen genes (group 1) carried several variants, including our best rare candidate SNV, which contributed to the score of the gene (SNVscore >0). In the other twenty-seven genes (group 2), only the family-specific variant participated to the gene score. In all the 43 genes, the variant with the highest score is the rare family-specific SNV that was selected previously. Thereafter, we will refer to this variant as the candidate variant. To validate the leading effect of the candidate variants, we performed again the burden test by including first, only the candidate SNVs, and then, all variants except these candidates (results in Table 3). All the 43 candidate SNVs, except SUSD5: c.526C>T (pVAASTp-value = 0.21) and SMYD5: c.662G>A (pVAASTp-value = 0.067), were significantly associated with RA (pVAASTp-value ≤ 0.037). Concerning the test without the candidate SNV, three genes in the first group (20%), SUSD5, SMYD5 and MNS1, were still significantly associated with RA (pVAASTp-value ≤ 0.036). So, several rare variants in our dataset contributed to the association of these genes with RA. However, none of the genes in the second group were associated with RA after excluding the candidate variant. This observation confirmed that the observed RA-association within these genes was in fact due to the family-specific variant.

Table 3

Association analysis of candidate SNVs in 43 candidate genes.

Gene		Candidate variant (CV)		pVAAST ^d
Name	Group^a	Chromosome: position^b	HGVS annotation^c	CV only	Without CV
TIMM44	1	19: 7997576	c.923C>A	0.0017 (11.37)	0.26 (2.58)
SLC9A6	2	X: 135122270	c.1763C>T	0.002 (6.57)	1 (0)
TMOD2	2	15: 52060595	c.263C>T	0.0023 (7.07)	1 (0)
INTS5	2	11: 62415889	c.1663C>T	0.0024 (13.35)	1 (0)
PTK2B	1	8: 27255130	c.29G>A	0.0028 (6.45)	0.17 (5.15)
EPB41L4A	1	5: 111504714	c.1828C>T	0.003 (13.76)	0.41 (4.03)
GPR35	2	2: 241569447	c.171C>G	0.0031 (13.72)	1 (0)
NEK1	2	4: 170506525	c.782G>A	0.0035 (8.6)	1 (0)
PCCA	2	13: 100921021	c.898G>A	0.0037 (7.7)	1 (0)
SUPT20H	2	13:37622040	c.73A>T	0.0038 (15.15)	1 (0)
MNS1	1	15: 56736015	c.724C>T	0.0038 (15.15)	0.036 (2.64)
DHRS7C	1	17: 9676206	c.608T>C	0.0041 (6.98)	0.41 (0.84)
NOLC1	2	10: 103917202	c.331C>T	0.0043 (12.89)	1 (0)
RSG1	2	1: 16558713	c.607C>T	0.0044 (10.72)	0.93 (0.01)
ABHD5	2	3: 43753284	c.590G>C	0.0046 (12.41)	1 (0)
AURKB	1	17: 8109861	c.637G>C	0.0051 (10.39)	0.14 (3.54)
CCDC189	2	16:30768844	c.949C>T	0.0051 (9.82)	1 (0)
CDKN2B	2	9: 22006147	c.256G>A	0.0052 (8.37)	1 (0)
KLHL1	1	13: 70681732	c.100G>T	0.0068 (9.53)	0.11 (2.77)
PTGR1	2	9: 114345759	c.488A>T	0.0074 (7.89)	1 (0)
GOLGA3	1	12: 133363022	c.3026G>A	0.0077 (8.59)	0.24 (3.17)
TRAK2	2	2: 202252596	c.1526G>A	0.014 (6.45)	1 (0)
ABCC1	1	16: 16208739	c.3196C>T	0.016 (9.33)	0.28 (3.86)
C10orf53	2	10: 50901917	c.195C>A	0.017 (13.76)	1 (0)
STC2	2	5: 172755066	c.131G>T	0.018 (8.74)	1 (0)
CIDEA	2	18: 12262897	c.214C>T	0.018 (9.33)	1 (0)
SBNO1	1	12: 123806235	c.2170G>T	0.019 (10.21)	0.9 (0.17)
TUBGCP5	1	15: 22864369	c.2327G>A	0.019 (7.21)	1 (0)
EIF2AK2	2	2: 37347149	c.1201G>A	0.02 (9.01)	1 (0)
SPATA13	1	13: 24868977	c.3367C>T	0.02 (6.79)	0.076 (6.46)
ADRA2B	2	2: 96780655	c.1234G>A	0.021 (7.03)	1 (0)
KCNA3	2	1: 111216792	c.640C>T	0.022 (8.1)	1 (0)
USE1	2	19: 17330089	c.490C>T	0.023 (12.36)	1 (0)
VCL	2	10: 75855428	c.1558C>T	0.023 (9.33)	1 (0)
CCDC24	2	1: 44461756	c.848G>A	0.025 (6.41)	1 (0)
GALNT10	1	5: 153765901	c.967G>A	0.026 (3.93)	1 (0)
SLC25A16	2	10: 70246944	c.799C>T	0.031 (12.35)	1 (0)
PHOX2A	2	11: 71950885	c.763C>T	0.032 (12.36)	1 (0)
SECISBP2	2	9: 91954819	c.1253G>T	0.033 (7.32)	1 (0)
STRN3	2	14: 31371744	c.2135C>T	0.034 (5.39)	1 (0)
FOXK2	1	17: 80543821	c.1321C>T	0.037 (7.93)	0.17 (4.17)
SMYD5	1	2: 73450610	c.833G>A	0.067 (1.69)	0.025 (4.86)
SUSD5	1	3: 33216450	c.526C>T	0.21 (5.55)	0.0037 (14.67)

The genes are sorted according to the association p-value of the candidate variant. For each gene, the candidate variant is the one with the highest score.

a: The two groups correspond to:

1 –genes containing multiple rare variants contributing to pVAAST score

2 –genes containing one rare variant leading the association

b: position on the human reference genome hg19

c: annotation on the canonical transcript

d: p-values computed from 106 permutations (score)

The genes are sorted according to the association p-value of the candidate variant. For each gene, the candidate variant is the one with the highest score. a: The two groups correspond to: 1 –genes containing multiple rare variants contributing to pVAAST score 2 –genes containing one rare variant leading the association b: position on the human reference genome hg19 c: annotation on the canonical transcript d: p-values computed from 106 permutations (score)

Validation in extended pedigrees

We selected the top 10 RA candidate variants, with the strongest association with RA, for re-sequencing to confirm their family specificity and their co-segregation with RA (Table 4). We analyzed further 8 out of 10 variants for which we were able to produce sequences and, we validated the familial specificity for all of them except PCCA. Four variants (in SUPT20H, SLC9A6, TIMM44 and NEK1 genes) were shared by all affected members in the families who carried them. However, only one candidate variant showed a complete segregation with RA. This variant is a heterozygous non-sense SNV, c.73A>T (p.Lys25*), located in SUPT20H gene which introduces a premature codon stop at the beginning of the gene (in the fourth exon) (Fig 3). This variant, not reported to date, has been deposited in ClinVar database prior to publication.

Table 4

Top 10 RA associated SNVs and their re-sequencing results.

Gene	Chromosome	Position (bp)	Ref/Alt	rs number^a	Family	#RA cases	#Unaffected
SUPT20H	13	37622040	T/A	.	3	4/4	0/5
SLC9A6	X	135122270	C/T	.	6	4/4	1/3
TIMM44	19	7997576	G/T	rs139625465	10	2/2	1/3
NEK1	4	170506525	C/T	rs200161705	3	4/4	2/5
TMOD2	15	52060595	C/T	.	4	4/5	3/6
PCCA	13	100921021	G/A	.	317	4/40/2	2/51/6
EPB41L4A	5	111504714	G/A	rs200812889	4	2/5	2/6
GPR35	2	241569447	C/G	rs141249079	NA	NA	NA
INTS5	11	62415889	G/A	.	NA	NA	NA

a: rs number from dbsnp138

Fig 3

Gene SUPT20H and its product: p38IP.

Abbreviations: NLS: Nuclear Localization signal, PEST: Proline (P) Glutamic acid (E) Serine (S) Threonine. (A) Gene model of SUPT20H. The red arrow indicates the exon including the nonsense variant identified by WES. The exons color code refers to the protein domain it encodes for. (B) On top, p38IP protein model of 733 amino-acid and on the bottom, the positions of p38IP interactors binding sites.

Gene SUPT20H and its product: p38IP.

Genotyping of SNV c.73A>T (p.Lys25*) in family 3 and in trios

The SNV c.73A>T (p.Lys25*) carried by SUPT20H was genotyped using a customized assay and digital PCR. First, genotyping results in family 3 validated the presence of the rare A allele in affected members of the family and, its absence in unaffected members (Fig 4). Second, we investigated 188 RA patients (characteristics described in supplementary S1 Table) and 362 healthy parents from trio families using one affected member of family 3 as a positive control. No one of these samples showed the rare allele of this SNV.

Fig 4

Representation of the pedigree 3 and the variants identified in his members.

All the individuals represented here were part of the validation set but only the ones marked with a red arrow were part of the discovery set. Each of them is represented by a circle (if female) or a square (if male). Their filling color corresponds to their status: black if affected by RA or white if unaffected by the disease. The presence of a variant is represented by its HGVS id. The absence of this id indicates a homozygote genotype for the reference allele.

Representation of the pedigree 3 and the variants identified in his members.

Discussion

In this study, we identified a novel SNV associated with RA, SUPT20H: c.73A>T(p.Lys25*), by performing WES in 9 multiplex RA pedigrees associated with HLA-DRB1 SE risk alleles. This nonsense variant had a complete penetrance in a family with rather young age at RA onset (meanonset-age = 31 years old). In addition, we showed with the RVAT pVAAST, that this variant was significantly associated with RA (pVAASTp-value = 3.78*10−3). The SUPT20H gene, which has not been reported yet as a RA risk gene, is a member of the SAGA complex (Spt-Ada-Gcn5 acetyltransferase) gene family. It encodes for the p38 interacting protein (p38IP) which is predicted to contain a Nuclear Localization Signal (NLS), a PEST domain (Proline Glutamic acid Serine Threonine) and 2 serine rich domains [16]. The SNV identified in this study occurring in the NLS domain, the truncated protein that could result from this variation would not have any functional domain. Previous studies reported p38IP interactions with 3 different proteins [16-18]. Indeed, the protein p38IP was shown to bind to and stabilize the protein GCN5 [17,19], member of the SAGA complex, and thus stabilize the complex itself [20]. In addition, in vitro analysis by Nagy and his colleagues [20] showed that this p38-GNC5 SAGA complex binds to the promotor of Endoplasmic Reticulum stress-induced genes, such as GRP78, enhancing their transcription. Interestingly, previous studies have noted the role of the GRP78 gene in the pathogenesis of RA: the expression of this gene is increased in RA fibroblast-like synoviocytes (FLS) and it promotes their proliferation [21,22]. The protein p38IP, as indicated by its name, interacts also with p38 MAPK, a key protein for the development of RA [23]. And it has previously been reported that this interaction inhibits monocyte/macrophage differentiation [19]. Macrophages, present in a high concentration in inflamed synovial membrane and cartilage/pannus junction, contributes to the maintenance of the inflammation in RA patients [24,25]. Finally p38IP interacts with the protein Atg9 and is required for its translocation from trans-Golgi system to endosome during starvation induced macro-autophagy [16,26,27]. The protein p38 MAPK, which has higher affinity for p38IP than Atg9 inhibits this interaction and thus inhibits the autophagy process. This implication of p38IP in autophagy process is another argument that highlights the interest of studying the role of p38IP in the development of RA. Indeed, different studies reviewed by Dai and Hu demonstrated the role of autophagy in RA [28]. In addition to SUPT20H gene, we identified 42 genes significantly associated with RA (p-value < 0.05). We evaluated their association with the test implemented in pVAAST. This software combines a rare variant association test (RVAT) with a linkage score, thus it takes into account for the familial structure of our data. It allowed us to provide all the variants called in our samples without any a priori filtering based on biological knowledge, avoiding loss of information. Indeed, the RVAT incorporates a score evaluating for each variant the amino acid substitution impact and phylogenetic conservation information, allowing it to be robust in the presence of neutral and common variants [12]. Indeed, depending on the chosen test, variants with opposite effects can lead to a decreasing power for the association test [29]. Among these 42 genes, we observed a variable number of variants contributing to the association signal: 16 genes with more than one contributing variant and 26 with only one. In the first group of genes, the potentially pathogenic variants covered the rare to low frequency spectrum. Three of these genes (named SMYD5, SUSD5 and MNS1), remained significantly associated with the disease without the candidate variants (pVAASTp-value < 0.036). The fact that the signal for the other genes was not significant anymore, when tested without the candidate SNV, could be explained by the lower number of contributing variants identified in these genes that reduce the RVAT power compared to genes with more causal variants [30] (Table 2). In the second group, 8 genes (CCDC189, CDKN2B, SLC9A6, PHOX2A, SLC25A16, TMOD2, INTS5 and KCNA3) contained only one variant. So, as expected, the score of the association test with all identified variants (familial data and control data) and with the candidate variant only were identical (see Table 1). We can highlight two limitations in this association analysis. First, we added as controls the variants of 98 individuals extracted from the 1000 genomes project database. And, although we chose individuals from the same population as our pedigrees (European) and sequenced on the same platform (Illumina), two aspects could have introduced bias: the exome capturing kit and the variant calling workflow. The first aspect can introduce variants specific to one of the dataset because not targeted by the kit used for the other one. To partially overcome this issue, we filtered out variants from the 1000 genomes dataset present outside the boundaries of our targeted exomes. But, due to the wide range of capturing kits used to produce the data presents in the 1000 genomes database, we did not filter the variants according to the regions targeted by the 1000 genomes project. For the second aspect, it has previously been shown that despite a few variant specific to a variant calling software, the concordance between the called variants is high (92% between HaplotypeCaller [31], Samtools [32] and FreeBayes [33]) [34]. In addition, the PCA, performed on all the individuals included in the association test (Fig 2), shows that the observed genetic variability is mostly related to the intra-family specificity, not to the choice of the 98 CEU controls, comforting our choice of controls. Second, individuals of the discovery set have different clinical characteristics (such as sex, age at diagnostic, and ACPA status). These differences, if not controlled, could have introduced bias in the association analysis. Since, the sex was the sole information known for both discovery set and CEU controls, we were only able to control for this co-variable. Finally, we previously suggested that rare variations present in the genome of RA cases in addition to HLA-DRB1 SE could modulate the effect of this latter, leading to RA cases with different clinical characteristics. To test this hypothesis, we could investigate the interactions between HLA-DRB1 and new RA risk genes, such as SUPT20H. For this analysis, we would need to perform a study including both individuals carrying and not carrying HLA-DRB1 risk alleles. Further work is also needed to identify new RA risk genes that may have been missed in this study because we used stringent criteria to select the candidate SNVs. Indeed, we removed variants: with incomplete penetrance, segregating in several families and not shared by all affected relatives of a given family. In conclusion, we identified a new rare nonsense SNV, SUPT20H: c.73A>T (p.Lys25*), associated with RA, by combining linkage and association analysis with pVAAST. Neither the gene nor the variant was previously associated with the disease. But the review of the literature about SUPT20H gene and its product, p38IP, supports the idea that this gene is involved in the pathogenesis of RA. Further work, with in vitro functional studies, needs to be done to evaluate the pathogenic effect of this new SNV and to validate its role in the different processes previously described in the context of RA. In addition, further studies need to be done to validate the other RA risk loci identified in this study, in particular SMYD5, MNS1 and SUSD5 genes in which we observed an aggregation of rare variants.

Methods

Participants

We studied 16 French RA multiplex families with at least 4 affected individuals per family carrying one or two HLA-DRB1 shared epitope (SE) allele. Half of the families had only RA cases, whereas the other half had RA and/or other Autoimmune Diseases (AID: Lupus Erythematosus, Vitiligo, Sjögren syndrome, Hashimoto's thyroiditis) cases. Among the 110 recruited individuals within these families, 33 (30%) were reported as only affected by RA, 17 (15%) by RA and another AID, 15 (14%) by an AID different from RA and 45 (41%) were reported as unaffected. As discovery set, we selected 30 individuals belonging to 9 of the 16 previously described pedigrees, with a sufficient quantity of DNA for WES. This sample set consisted in 19 individuals with RA and 11 relatives not affected by RA (at least one per pedigree). All the remaining samples were included in the validation set. Our study was approved by ethics committees of Hôpital Bicêtre and Hôpital Saint Louis (Paris, France; CPPRB 94–40). Everyone provided a written informed consent for the participation in the study. For statistical analysis with the Rare Variant Association Test (RVAT), we added 98 healthy controls extracted from the 1000 genomes project database. We chose CEU individuals (Utah residents with Northern and Western European ancestry) for which whole exome sequences were generated using an Illumina platform.

Whole Exome Sequencing and variant calling

The exons were captured in the discovery set with Agilent SureSelect Human All Exon kit (V5) which target the exome of more than 20,000 genes. Then, we sequenced those regions on an Illumina HiSeq2000 platform and mapped the reads to the human reference genome hg19 [35] using BWA-MEM algorithm [36]. Finally, we marked and removed the duplicates with Picard toolkit [37]. The observed variance being low, we did not apply the recalibration to the WES data. We called the variants using Haplotype Caller (HC) algorithm from the GATK suite [31,38] in the targeted regions plus 150 bp up and downstream. Then, we filtered out SNVs and small indels (maximum length of 50 bp) according to the following criteria: total read depth DP < 12, mapping quality MQ < 30, variant confidence QD < 2 and strand bias FS score > 25. We annotated genotypes with an individual DP < 10 as missing before filtering out variants with call-rate < 95% with Plink1.9 [39]. Finally, we removed variants located in segmental duplications [40] and repeated regions, such as described in RepeatMasker from UCSC [41].

Variant annotation and classification

We classified the remaining variants into rare variants (MAF < 1%), low frequency variants (1% ≤ MAF < 5%) and common variants (MAF ≥ 5%) according to their frequency in public databases. To evaluate these frequencies, we worked on 4 datasets extracted from: (1) the 1000 Genomes Project (2015 August); (2) the Exome Aggregation Consortium project (ExAC); (3) the Exome Sequencing project (ESP6500–6500 exomes); (4) the Complete Genomics project (CG69–69 individuals). We extracted the allele frequency in population with European ancestry when the information was available. To investigate the predicted effect of the genetic variants on proteins, we annotated them with CADD phred-like score [42] using ANNOVAR [43] and with variant effect defined in SNPeff [44]. The former is a score obtained from a model trained to separate evolutionary conserved variants, likely deleterious, from simulated variants, likely benign. And the latter evaluate the putative variant impact at the transcript level by using sequence ontology.

Principal Component Analysis of CEU controls and discovery set

To identify possible genetic population stratification between the discovery set and the 98 CEU controls, we performed a principal component analysis (PCA). For this analysis, we used a subset of variants located on autosomes classified as neutrals and frequents. We defined as neutral a variant annotated “LOW” by SNPEff and with a CADD phred-like score < 5. Those variants were considered frequent if their MAF was superior or equal to 5%. In addition, we processed the selected variants prior to performing the PCA analysis to remove those not in Hardy-Weinberg equilibrium (HW) and those in Linkage Disequilibrium (LD) with each other. We used Plink1.9 [39,45] to remove variants with HWp-value < 0.01 and/or with LD ≥ 0.2. The Hardy-Weinberg equilibrium was evaluated on unaffected individuals only. We finally applied the PCA on these variations with the R package SNPRELATE [46].

Selection of RA risk candidate variants

We first selected the rare variants (MAF ≤ 1%) observed in only one family with an in-house python script. Further, we chose variants segregating with RA within families and being absent from healthy relatives.

Association analysis of selected genes with burden test

We tested the association of genes carrying at least one rare candidate variant using pVAAST [12]. This software extends the rare variant association test (RVAT) VAAST [47] to offer more power in the context of family-based studies. It provides a linkage-association score and a p-value for each evaluated gene and, a score for each variant carried by the gene to help the prioritization of these variants. Considering the heterozygous nature of our candidate SNVs, we tested the candidate genes under a dominant model and set the maximum disease prevalence to 0.01, the world-wide prevalence of RA [48]. In addition, we provided to pVAAST a file containing for each variant included in the analysis the maximum frequency observed in the four public databases described above. To evaluate the significance of the test, we authorized genotyping error and did not restrain penetrance value for gene-drop simulations. We estimated p-values by allowing pVAAST to perform up-to 106 permutations. We ran a first time pVAAST for each of our candidate genes by including all the variants, rare and frequent, identified in our discovery set and/or in the 98 controls from 1000 genomes project). The genes significantly associated with RA (p-valuerun1 < 0.05) were selected for a second run of pVAAST using the same parameters plus a control of the sex co-variable. Then, for each gene with p-valuerun2 < 0.05, we selected the variant with the highest score. We ran pVAAST on this variant alone (run 3) and then on the gene without this variant (run 4).

Validation in extended pedigrees of exome selected variants

The top 10 candidate variations, selected with the lowest p-values in the 3rd run of the pVAAST analysis, were re-sequenced in the validation set by using PGM System (Ion Personal Genome Machine). Primers were designed with the Ion AmpliSeq Designer to target the selected variants. We mapped the reads to the reference genome with BWA [36] and recalibrated base score with BQSR tool from GATK suite [31]. We then called variants in the strict limits of the targeted regions with HC and applied the same quality filters used for WES to the observed variants. For samples in both discovery and validation sets, we checked the concordance with whole exome data using VCFtools program package [49].

Genotyping of SUPT20H: c.73A>T in multiplex families and additional trios

The candidate rare variant was also genotyped using a custom assay with a FAM or VIC reporter Dye at the 5’end of each TaqMan MGB probe and a non-fluorescent quencher at the 3’end of each probe (Applied Biosystems, Foster City, CA, USA). Digital PCR (QX200 Droplet Digital PCR, Bio-Rad Laboratories, California, USA) was used to detect variant alleles. First, we analyzed members of the family in which the variant was characterized. Second, an independent sample of 188 trio families (one RA patient and his/her two parents from French European origin) was investigated to search for the variant, along with a positive and a negative control.

Pedigrees of the 16 families included in this study.

(XLSX) Click here for additional data file.

Candidate gene variant calling file.

VCF file containing all the high-quality variants identified in the 77 candidate genes. (VCF) Click here for additional data file.

Unfiltered pVAAST results.

(XLSX) Click here for additional data file.

Characteristics of 188 RA index cases in which the SNV c.73A>T carried by SUPT20H was investigated.

Number of index cases / Number of index cases with data previous and/or actual tobacco exposure (smokers and ex-smokers). RF: Rheumatoid Factor. ACPA: Anti-Cyclic Citrullinated Peptide Antibodies. (DOCX) Click here for additional data file.

44 in total

1. Initial sequencing and analysis of the human genome.

Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal: Nature Date: 2001-02-15 Impact factor: 49.962

2. p38 and a p38-interacting protein are critical for downregulation of E-cadherin during mouse gastrulation.

Authors: Irene E Zohn; Yingqiu Li; Edward Y Skolnik; Kathryn V Anderson; Jiahuai Han; Lee Niswander
Journal: Cell Date: 2006-06-02 Impact factor: 41.582

Review 3. New insights into the function of Atg9.

Authors: Jemma L Webber; Sharon A Tooze
Journal: FEBS Lett Date: 2010-01-17 Impact factor: 4.124

4. Regulation of autophagy by p38alpha MAPK.

Authors: Jemma L Webber
Journal: Autophagy Date: 2010-02-06 Impact factor: 16.016

Review 5. Genetics of rheumatoid arthritis - a comprehensive review.

Authors: Júlia Kurkó; Timea Besenyei; Judit Laki; Tibor T Glant; Katalin Mikecz; Zoltán Szekanecz
Journal: Clin Rev Allergy Immunol Date: 2013-10 Impact factor: 8.667

6. Coordinated regulation of autophagy by p38alpha MAPK through mAtg9 and p38IP.

Authors: Jemma L Webber; Sharon A Tooze
Journal: EMBO J Date: 2009-11-05 Impact factor: 11.598

Review 7. Epidemiology and genetics of rheumatoid arthritis.

Authors: Alan J Silman; Jacqueline E Pearson
Journal: Arthritis Res Date: 2002-05-09

8. Cells of the synovium in rheumatoid arthritis. Macrophages.

Authors: Raimund W Kinne; Bruno Stuhlmüller; Gerd-R Burmester
Journal: Arthritis Res Ther Date: 2007 Impact factor: 5.156

9. The UCSC Genome Browser database: 2014 update.

Authors: Donna Karolchik; Galt P Barber; Jonathan Casper; Hiram Clawson; Melissa S Cline; Mark Diekhans; Timothy R Dreszer; Pauline A Fujita; Luvina Guruvadoo; Maximilian Haeussler; Rachel A Harte; Steve Heitner; Angie S Hinrichs; Katrina Learned; Brian T Lee; Chin H Li; Brian J Raney; Brooke Rhead; Kate R Rosenbloom; Cricket A Sloan; Matthew L Speir; Ann S Zweig; David Haussler; Robert M Kuhn; W James Kent
Journal: Nucleic Acids Res Date: 2013-11-21 Impact factor: 16.971

10. A novel miR-200b-3p/p38IP pair regulates monocyte/macrophage differentiation.

Authors: Xiao Yu; Qi-Long Wang; Yue-Fang Li; Xu-Dong Wang; Anlong Xu; Yingqiu Li
Journal: Cell Discov Date: 2016-01-26 Impact factor: 10.849

3 in total

1. Genome-wide association study of aromatase inhibitor discontinuation due to musculoskeletal symptoms.

Authors: Daniel L Hertz; Julie A Douglas; Robert M Miller; Kelley M Kidwell; Christina L Gersch; Zeruesenay Desta; Anna Maria Storniolo; Vered Stearns; Todd C Skaar; Daniel F Hayes; N Lynn Henry; James M Rae
Journal: Support Care Cancer Date: 2022-07-01 Impact factor: 3.359

2. The p38-interacting protein p38IP suppresses TCR and LPS signaling by targeting TAK1.

Authors: Xu-Dong Wang; Chen-Si Zhao; Qi-Long Wang; Qi Zeng; Xing-Zhi Feng; Lianbo Li; Zhi-Long Chen; Yu Gong; Jiahuai Han; Yingqiu Li
Journal: EMBO Rep Date: 2020-05-15 Impact factor: 8.807

3. Identification of autophagy-related risk signatures for the prognosis, diagnosis, and targeted therapy in cervical cancer.

Authors: Dan Meng; Hua Jin; Yun Gu; Shizhi Wang; Xing Zhang; Wenjing Yan; Qianqian Xia; Siyuan Shen; Shuqian Xie; Mengjing Cui; Bo Ding
Journal: Cancer Cell Int Date: 2021-07-08 Impact factor: 5.722

3 in total