Literature DB >> 30891314

Studying the effects of haplotype partitioning methods on the RA-associated genomic results from the North American Rheumatoid Arthritis Consortium (NARAC) dataset.

Mohamed N Saad¹, Mai S Mabrouk², Ayman M Eldeib³, Olfat G Shaker⁴.

Abstract

The human genome, which includes thousands of genes, represents a big data challenge. Rheumatoid arthritis (RA) is a complex autoimmune disease with a genetic basis. Many single-nucleotide polymorphism (SNP) association methods partition a genome into haplotype blocks. The aim of this genome wide association study (GWAS) was to select the most appropriate haplotype block partitioning method for the North American Rheumatoid Arthritis Consortium (NARAC) dataset. The methods used for the NARAC dataset were the individual SNP approach and the following haplotype block methods: the four-gamete test (FGT), confidence interval test (CIT), and solid spine of linkage disequilibrium (SSLD). The measured parameters that reflect the strength of the association between the biomarker and RA were the P-value after Bonferroni correction and other parameters used to compare the output of each haplotype block method. This work presents a comparison among the individual SNP approach and the three haplotype block methods to select the method that can detect all the significant SNPs when applied alone. The GWAS results from the NARAC dataset obtained with the different methods are presented. The individual SNP, CIT, FGT, and SSLD methods detected 541, 1516, 1551, and 1831 RA-associated SNPs respectively, and the individual SNP, FGT, CIT, and SSLD methods detected 65, 156, 159, and 450 significant SNPs respectively, that were not detected by the other methods. Three hundred eighty-three SNPs were discovered by the haplotype block methods and the individual SNP approach, while 1021 SNPs were discovered by all three haplotype block methods. The 383 SNPs detected by all the methods are promising candidates for studying RA susceptibility. A hybrid technique involving all four methods should be applied to detect the significant SNPs associated with RA in the NARAC dataset, but the SSLD method may be preferred because of its advantages when only one method was used.

Entities: CellLine Chemical Disease Gene Mutation Species

Keywords: Confidence interval test; Four-gamete test; Genome-wide association study; NARAC; Rheumatoid arthritis; Solid spine of linkage disequilibrium

Year: 2019 PMID： 30891314 PMCID： PMC6403413 DOI： 10.1016/j.jare.2019.01.006

Source DB: PubMed Journal: J Adv Res ISSN： 2090-1224 Impact factor: 10.479

Introduction

RA, a chronic autoimmune disease that affects the body’s joints and bones, is considered to have a genetic basis. Genetic association studies are used to detect RA biomarkers, and SNPs are used as biomarkers for detecting RA. The number of these nucleotide morphisms is larger in RA patients than in healthy controls. These SNPs are in or near genes that commonly play a role in immunity. Most of these genes are linked to RA pathogenesis [1], [2], [3], [4]. The rapid progress in genotyping technologies has resulted in an ever-increasing volume of genotyped SNPs, which has led to advances in the understanding of complex diseases (such as RA) and represents a challenge for the future [5]. Single SNP methods are the main techniques used to identify RA biomarkers. Recently, the ability to obtain a high genomic density of SNPs (representing big data) has led to the application of haplotype block methods. These methods are applied to discover RA associations with a block rather than an SNP. A haplotype block consists of nearby SNPs that have high inter-relationships with one another. The parameter representing these relationships is the linkage disequilibrium (LD) [6], [7], [8]. The objective of the present work was to apply the individual SNP approach and three haplotype block methods to the NARAC dataset to identify RA biomarkers through a GWAS [9]. GWAS results represent a domain of big data with millions of SNPs tested against many phenotypes. These results have become a burden for bioinformaticians in terms of processing time and real-time visualization [10], [11]. The applied haplotype block methods were CIT, FGT, and SSLD. After stringent Bonferroni correction for multiple comparisons (less than 0.05 per the number of comparisons), P-values were calculated to measure the strength of association between the genetic variants and RA susceptibility [12]. In addition, the block size (in base pair (bp) and the included number of SNPs), number of blocks, percentage of SNPs not covered by the block method, percentage of significant blocks in the total number of blocks, number of significant haplotypes and SNPs were compared among the three haplotype block methods.

Material and methods

Study population

The NARAC dataset consisted of 2062 participants (1493 female and 569 male), grouped into 868 RA patients and 1194 healthy controls. All cases and controls were Caucasian [13]. The studied genetic variants were 545,080 SNPs included in the whole genome. Because allosomes (sex chromosomes (Chrs)) were outside of this research focus, 531,689 SNPs were retained for the study. After removing 22,276 SNPs because they met at least one of the following biomarker characteristics, 509,413 SNPs remained for further analysis: Less than 75% genotype match [14], Less than 0.001 Hardy-Weinberg equilibrium (HWE) P-value [15] or Less than 0.001 minor allele frequency (MAF) in the total sample [16]. The NARAC dataset represents a big data challenge because of its size and complexity. A way to handle such a challenge is to place the raw GWAS data for every Chr into a separate file. Then, each file is processed using GWAS software. Finally, the results for all the Chrs are merged together. A snapshot of the NARAC (raw) dataset is shown in Fig. 1.

Fig. 1

Snapshot of the NARAC dataset showing 10 samples with their corresponding 3 SNPs. The first column represents the individuals’ IDs. The second column refers to the affection status (0: case, 1: control). The third column shows the sex (F: female, M: male). The next columns correspond to the SNPs, with the first row providing the SNP ID. In each SNP cell, two identical alleles represent a homozygote, whereas two different alleles represent a heterozygote.

Material

For the NARAC dataset, each Chr data file was extracted from the NARAC data file using the programming language Perl. All Chr data files were reformatted for processing by the program PLINK in the statistical package R 3.1.0. The R language was used to extract all the Chrs map files from the NARAC map file (SNP ID, physical position, and Chr number). Each reformatted Chr data and map files were processed by PLINK 1.07 and gPLINK 2.05 in preparation for processing by the program Haploview 4.2 [17]. Haploview 4.2 was used to partition all the Chrs into successive blocks using the CIT, FGT, and SSLD methods; to calculate the corresponding P-values for each haplotype in each block; to apply the individual SNP approach; to calculate the corresponding P-value for each SNP; and to display the LD results [18]. The default parameters for the three haplotype block methods were used. The RA-associated SNPs determined by using the individual SNP approach were highlighted on a Manhattan plot generated using R [19]. The significant blocks and the associated SNPs were selected using MATLAB release 2010a. Fig. 2 shows a block diagram of the entire association analysis. The DAVID (database for annotation, visualization and integrated discovery) bioinformatics resources 6.8 was operated to perform a functional pathway analysis and a disease enrichment analysis [20], [21].

Fig. 2

Summary of the proposed system for the NARAC dataset.

Testing for associations with RA susceptibility

Both individual SNP associations and haplotype associations were measured with the aid of P-values. Statistically significant SNPs were detected using their corresponding P-values after stringent Bonferroni correction for multiple comparisons (less than 0.05 per the number of comparisons).

Results

Four methods were applied to the NARAC dataset: the individual SNP approach and three haplotype block methods. The three block methods were FGT, CIT, and SSLD. The measured parameter was the P-value after Bonferroni correction. The three haplotype block methods were compared on the basis of the block size (in bp or number of SNPs), number of blocks, percentage of uncovered SNPs, percentage of significant blocks, percentage of significant haplotypes, and number of associated SNPs. The test algorithms were applied on an Intel Core i7-4720HQ 2.6 GHz system with 16 GB of RAM. Table S1 lists the processing time for each program. The total working time for all Chrs was 3353 min (approximately 56 h). Table S2 shows the significance level after Bonferroni correction for multiple comparisons (0.05/total number of comparisons). The results related to the haplotype block methods are shown in Tables S3–S24. FGT partitioned the twenty-two Chrs into more blocks (99,856 blocks) than CIT (93,422 blocks) and SSLD (86,179 blocks). On average, the SSLD blocks included more SNPs per Chr (5 SNPs) than FGT (4 SNPs) and CIT (3 SNPs). As shown in Table 1, the median block size per Chr was larger for SSLD (12,046 bp) than for FGT (8328 bp) and CIT (7368 bp), confirming the greater genomic coverage by SSLD blocks. These results were checked for significance using Kruskal–Wallis test by ranks. The Kruskal–Wallis test showed the presence of statistically significant difference in the distribution of the median block size among the three methods (P-value = 1.39 × 10−09). Using Wilcoxon rank sum test, the differences between (FGT and SSLD), (CIT and SSLD), and (CIT and FGT) were statistically significant (P-values = 1.986 × 10−07, 1.515 × 10−08, and 0.009, respectively).

Table 1

Results of the median block size (in bp) by all three block methods for the general blocks and the significantly associated blocks with RA.

Chr no.	CIT (General)	FGT (General)	SSLD (General)	CIT (Significant)	FGT (Significant)	SSLD (Significant)
1	8489	9547	13,549	64,634	47,700	34,467
2	8495	9645	14,342	24,123	11,756	23,312
3	7938	9240	13,544	7513	11,854	13,800
4	9947	11,083	13,544	3279	3279	0
5	8641	9697	14,102	22,052	15,381	18,456
6	8457	9583	13,944	8672	7448	10,123
7	8235	9008	13,869	27,949	4326	32,616
8	7149	7971	12,262	15,280	14,404	10,115
9	6324	7166	10,297	10,662	15,473	13,315
10	7464	8392	12,231	2462	669	9719
11	7764	8634	12,455	9746	9504	0
12	8043	8898	13,281	5705	5705	10,091
13	8346	9134	13,410	9913	4663	32,705
14	7458	8443	12,747	18,225	12,316	18,225
15	6151	7336	10,451	9321	11,213	14,822
16	4912	5562	8984	24,155	6893	64,712
17	6263	7535	9997	12,690	57,213	18,594
18	6811	7962	11,379	0	8210	11,265
19	6760	7930	10,833	9571	10,633	18,621
20	6413	6933	10,563	7448	6133	21,323
21	6784	7552	10,871	13,020	11,817	4704
22	5272	5986	8381	9298	10,650	24,936

Results of the median block size (in bp) by all three block methods for the general blocks and the significantly associated blocks with RA. Although, SSLD produced the lowest number of blocks, due to its median block size and median number of SNPs within each block, 95.68% of the genotyped SNPs were localized with SSLD, compared to 87.74% with FGT and 77.88% with CIT. Accordingly, the density of the genotyped SNPs was sufficient for haplotype association mapping. The lowest number of studied SNPs needed for GWASs is 100,000 [15] which was attained by the four methods. Considerable variation in the haplotype block structure across the twenty-two Chrs was uncovered, with block sizes ranging from 2 bp (for the three methods) to 498,545 bp for FGT, 498,091 bp for SSLD, and 499,937 bp for CIT. FGT generated more significant haplotypes (437 haplotypes) than CIT (396 haplotypes) and SSLD (383 haplotypes) for the twenty-two Chrs. As shown in Tables S3–S24, the average percentage of significant blocks in the total number of blocks per Chr was higher for FGT (0.248%) than for CIT (0.241%) and SSLD (0.226%). Fig. 3 shows the significant blocks obtained with the three haplotype block methods for the twenty-two Chrs. For each Chr, the total number of significant blocks, the total number of associated SNPs, and the total sizes of the significant blocks (in bp) are shown in Fig. 3a–c respectively.

Fig. 3

Comparison of the RA-associated results obtained by the three haplotype block partitioning methods. (a) The total number of significant blocks for each Chr. (b) The total number of associated SNPs for each Chr. (c) The total significant blocks size in bp for each Chr. On average, the significant SSLD blocks included more SNPs per Chr (6 SNPs) than the significant FGT (4 SNPs) and CIT (4 SNPs) blocks. The median significant block size for the twenty-two Chrs was larger for SSLD (32,550 bp) than for CIT (14,350 bp) and FGT (13,055 bp). These results were checked for significance using Kruskal–Wallis test by ranks. The difference among the three groups determined using Kruskal–Wallis was not statistically significant (P-value = 0.077). The minimum significant block size for the twenty-two Chrs was larger for SSLD (52 bp for Chr 8) than for FGT (26 bp for Chr 6) and CIT (15 bp for Chr 11). The maximum significant block size was larger for SSLD (344,667 bp for Chr 1) than for FGT (318,113 bp for Chr 3) and CIT (209,237 bp for Chr 6). The significant SSLD blocks included more associated SNPs (1831 SNPs) than the significant FGT (1551 SNPs) and CIT (1516 SNPs) blocks. In addition, the number of associated SNPs determined by the individual SNP approach was 541, as shown in Table 2. The number of significant SNPs discovered by only the SSLD method (450 SNPs) was greater than that by the CIT (159 SNPs), FGT (156 SNPs), and individual SNP (65 SNPs) methods, as shown in Fig. 4.

Table 2

Results of the individual SNP approach compared to all three block methods.

Chr no.	Total no. of significant SNPs obtained by the individual SNP method	No. of significant SNPs obtained by only the individual SNP method	No. of significant SNPs obtained by all three block methods	No. of significant SNPs obtained by all four methods
1	4	3	8	1
2	2	2	0	0
3	5	3	7	0
4	5	2	0	0
5	6	4	8	2
6	432	12	916	367
7	7	3	2	0
8	11	3	14	1
9	11	4	16	7
10	5	2	0	0
11	2	1	0	0
12	3	1	6	0
13	0	0	0	0
14	5	2	11	1
15	3	3	5	0
16	7	5	0	0
17	4	2	11	0
18	3	1	0	0
19	5	2	13	2
20	8	5	3	1
21	7	2	0	0
22	6	3	1	1

Fig. 4

Number of RA biomarkers detected by each method – “all” biomarkers detected by the method or detected “only” by one method.

Results of the individual SNP approach compared to all three block methods. Number of RA biomarkers detected by each method – “all” biomarkers detected by the method or detected “only” by one method. Fig. 5 shows the associations across the entire genome, illustrating the big data challenge. The alternating colours (blue and red) distinguish between the end of one Chr and the start of the next Chr. The lower horizontal line in Fig. 5 represents the threshold for suggestive associations (−log10 (10−5)), while the higher line represents the genome-wide significance threshold (−log10 (5 × 10−8)). The associated SNPs are highlighted in green. As expected, most of the associated SNPs on Chr 6 showed highly significant associations with RA susceptibility (P-values < 0.0001). In contrast, none of the SNPs on Chr 13 showed any association with RA. Chr 6 contained most of the known genetic biomarkers for RA. The top SNP (rs660895) in the human leukocyte antigen (HLA) region (32,685,358 bp), representing the HLA-DRB1/HLA-DQA1, had the lowest P-value (1.03 × 10−113), as previously reported [22], [23], [24], [25].

Fig. 5

Manhattan plot showing the associations between the whole NARAC SNPs and RA susceptibility using the individual SNP approach. The genes with P-values lower than the genome-wide significance threshold are shown above the plot area.

Discussion

In this study, 509,413 SNPs were used to test the association with RA susceptibility in the NARAC dataset. The examined SNPs belonged to twenty-two autosomes, providing a large data domain. The surveyed SNPs of the NARAC dataset were dense enough for examination by haplotype block methods. Four methods were applied to assign the associations (CIT, FGT, SSLD, and the individual SNP approach). The aim was to test the NARAC dataset to determine whether haplotype block methods or a single-locus approach alone can sufficiently identify the significant biomarkers associated with RA. This research failed to select the best method because each method resulted in significant findings that were not detected using any of the other methods. The individual SNP, CIT, FGT, and SSLD methods exclusively detected 65, 159, 156, and 450 SNPs respectively. Table S25 shows the SNP IDs that were uniquely identified by each method. These findings were in line with Shim et al.’s (although they did not test the SSLD method) conclusion that both the individual SNP approach and the haplotype block methods should be applied to discover valuable associations in the NARAC dataset [16]. As shown in Table 2, the 383 SNPs that were determined to be significantly associated with RA susceptibility by the individual SNP approach and the haplotype block methods represent good candidates for further investigation. In addition, 1021 RA-associated SNPs were detected by all three haplotype block methods and deserve greater attention. The SSLD method detected more significant SNPs (1831 SNPs) than the FGT (1551 SNPs), CIT (1516 SNPs), and individual SNP (541 SNPs) methods potentially because SSLD does not consider the LD between intermediate SNPs. Therefore, the SSLD method is the least conservative at including SNPs inside the haplotype blocks. The biomarkers identified by the individual SNP approach with P-values lower than the genome-wide significance threshold (shown in Fig. 5) are given in Table 3 with their corresponding haplotype blocks. Three hundred and twenty biomarkers from Chr six passed the genome-wide significance threshold (data not shown). The SNPs from Chrs 11, 13, 15, 19, and 21 failed to pass the genome-wide significance threshold. Five of the seven biomarkers from Chr 9 were members of a block that was detected by all three block methods. This finding emphasized the association of the PHF19-TRAF1-C5 region with RA [26].

Table 3

The highly significant SNPs (with P-values lower than the genome-wide significance threshold) discovered by the individual SNP approach with the corresponding haplotype blocks.

SNP ID	Chr	Position (bp)	Assoc. Allelea	AAFb (Case, Control)	P-valuec	Gene/Nearest Genes	Haplotype Block (Method, P-valuec, No. of SNPs in Block)	Haplotype Block Position (bp) (Start, End, Size)	Previously Studied in
rs2493291	1	3,352,541	G	0.956, 0.881	1.56 E-14	PRDM16	Not detected by any method	–	[28]
rs2476601	1	114,089,610	A	0.155, 0.084	1.12 E-12	PTPN22	FGT, 8.5 E-13, 8	114075501, 114132504, 57,004	[22], [24], [25], [29], [30], [31], [32], [33]
							CIT, 1.01 E-11, 10	114050631, 114141503, 90,873
							SSLD, 1.03 E-10, 33	113787838, 114132504, 344,667
rs12467084	2	37,860,221	G	0.994, 0.964	1.12 E-09	CDC42EP3/FAM82A1	Not detected by any method	–	–
rs6752643	2	198,949,233	G	0.989, 0.956	2.94 E-09	PLCL1/SATB2	Not detected by any method	–	–
rs11915402	3	58,957,115	G	0.995, 0.956	8.43 E-13	C3orf67	FGT, 1.51 E-07, 20	58754521, 59072633, 318,113	–
rs11915402	3	58,957,115	G	0.995, 0.956	8.43 E-13	C3orf67	SSLD, 2.51 E-11, 9	58957115, 59057595, 100,481	–
rs512244	4	12,775,151	G	0.195, 0.125	3.7 E-09	HS3ST1/HSP90AB2P	Not detected by any method	–	[22], [31]
rs17604670	4	113,564,881	G	0.966, 0.923	3.84 E-08	TIFA	Not detected by any method	–	–
rs2278600	5	71,792,426	G	0.930, 0.865	3.22 E-10	ZNF366	Not detected by any method	–	–
rs6596147	5	133,075,674	G	0.820, 0.738	1.77 E-09	FSTL4/C5orf15	FGT, 3.51 E-06, 9	133065358, 133094704, 29,347	[32], [33], [34], [35]
							CIT, 2.95 E-06, 9	133057095, 133094704, 37,610
							SSLD, 2.1 E-07, 6	133075674, 133094129, 18,456
rs2306848	7	129,556,365	G	0.990, 0.948	5.95 E-12	CPA4	Not detected by any method	–	–
rs1830035	7	63,170,795	A	0.996, 0.963	1.47 E-11	ZNF679	SSLD, 3.6 E-11, 4	63138417, 63170795, 32,379	–
rs10275421	7	100,536,496	G	0.991, 0.960	8.12 E-09	FIS1/RABL5	SSLD, 7.17 E-08, 2	100522057, 100536496, 14,440	–
rs11785995	8	131,021,293	G	0.982, 0.938	2.18 E-10	FAM49B	Not detected by any method	–	–
rs9785133	8	20,402,898	G	0.916, 0.860	3.9 E-08	LZTS1/LOC286114	FGT, 1.21 E-07, 6	20385189, 20404428, 19,240	[34]
rs872863	9	123,233,908	G	0.993, 0.940	2.25 E-16	DENND1A	Not detected by any method	–	[36]
rs7854383	9	81,666,969	G	0.959, 0.906	1.42 E-09	TLE1/FAM75D5	FGT, 1.69 E-08, 2	81666969, 81670581, 3613	[37]
							CIT, 1.08 E-07, 2	81662684, 81666969, 4286
							SSLD, 1.21 E-07, 3	81662684, 81670581, 7898
rs2900180	9	120,785,936	A	0.390, 0.303	6.24 E-09	TRAF1/C5	FGT, 4.66 E-08, 14	120720054, 120810962, 90,909	[26], [34], [36], [38], [39], [40], [41], [42], [43], [44]
							CIT, 8.03 E-08, 8	120720054, 120807548, 87,495
							SSLD, 4.5 E-08, 12	120720054, 120807548, 87,495
rs3761847	9	120,769,793	G	0.468, 0.380	1.24 E-08	TRAF1	FGT, 4.66 E-08, 14	120720054, 120810962, 90,909	[26], [34], [40], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51]
							CIT, 8.03 E-08, 8	120720054, 120807548, 87,495
							SSLD, 4.5 E-08, 12	120720054, 120807548, 87,495
rs881375	9	120,732,452	A	0.388, 0.304	2.27 E-08	PHF19/TRAF1	FGT, 4.66 E-08, 14	120720054, 120810962, 90,909	[34], [36], [49], [52], [43], [53], [54]
							CIT, 8.03 E-08, 8	120720054, 120807548, 87,495
							SSLD, 4.5 E-08, 12	120720054, 120807548, 87,495
rs1953126	9	120,720,054	A	0.387, 0.304	2.76 E-08	PHF19	FGT, 4.66 E-08, 14	120720054, 120810962, 90,909	[34], [36], [43], [44], [48], [53], [54]
							CIT, 8.03 E-08, 8	120720054, 120807548, 87,495
							SSLD, 4.5 E-08, 12	120720054, 120807548, 87,495
rs10760130	9	120,781,544	G	0.475, 0.389	3.78 E-08	TRAF1/C5	FGT, 4.66 E-08, 14	120720054, 120810962, 90,909	[34], [36], [39], [40], [43], [44], [49], [53], [54], [55]
							CIT, 8.03 E-08, 8	120720054, 120807548, 87,495
							SSLD, 4.5 E-08, 12	120720054, 120807548, 87,495
rs4918037	10	105,403,030	G	0.958, 0.897	6.12 E-11	SH3PXD2A	Not detected by any method	–	–
rs2671692	10	49,767,825	A	0.677, 0.592	2.66 E-08	WDFY4	SSLD, 4.84 E-08, 6	49767825, 49777543, 9719	[34], [35], [51], [53]
rs10999147	10	71,550,864	A	0.976, 0.939	4.16 E-08	AIFM2	FGT, 1.91 E-06, 2	71550196, 71550864, 669	–
rs4760609	12	46,702,024	C	0.907, 0.819	3 E-12	COL2A1/SENP1	FGT, 1.23 E-07, 3	46700325, 46703575, 3251	–
rs757123	12	119,263,543	G	0.943, 0.888	1.72 E-08	MSI1	Not detected by any method	–	–
rs4264325	14	104,050,531	G	0.997, 0.973	1.94 E-08	KIF26A/C14orf180	FGT, 5.69 E-06, 8	104045894, 104062173, 16,280	–
rs2292327	16	82,588,153	G	0.516, 0.405	1.16 E-09	NECAB2	Not detected by any method	–	–
rs2745106	16	1,481,462	G	0.954, 0.904	1.77 E-08	PTX4/TELO2	Not detected by any method	–	–
rs11868709	17	73,740,166	C	0.817, 0.714	7.38 E-11	TMEM235	Not detected by any method	–	–
rs8087252	18	44,295,753	G	0.924, 0.865	7.13 E-09	ZBTB7C/CTIF	Not detected by any method	–	–
rs6018432	20	35,485,260	G	0.956, 0.888	3.55 E-13	SRC/BLCAP	Not detected by any method	–	[56]
rs1182531	20	57,826,397	C	0.852, 0.779	6.53 E-09	PHACTR3	FGT, 1 E-08, 2	57826397, 57832814, 6418	[22], [31], [34], [35], [57]
rs1182531	20	57,826,397	C	0.852, 0.779	6.53 E-09	PHACTR3	SSLD, 1 E-08, 2	57826397, 57832814, 6418	[22], [31], [34], [35], [57]
rs13054355	22	20,321,624	G	0.930, 0.854	6.04 E-12	SDF2L1	FGT, 5.08 E-08, 7	20264229, 20321624, 57,396	–
							CIT, 1.09 E-08, 3	20313153, 20321624, 8472
							SSLD, 1.09 E-06, 3	20321624, 20346559, 24,936
rs1005133	22	18,112,909	G	0.844, 0.767	4.08 E-08	SEPT5-GP1BB/TBX1	FGT, 1.02 E-05, 2	18112175, 18112909, 735	–
rs1005133	22	18,112,909	G	0.844, 0.767	4.08 E-08	SEPT5-GP1BB/TBX1	CIT, 1.02 E-05, 2	18112175, 18112909, 735	–

Assoc. Allele: Associated Allele.

AAF: Associated Allele Frequency.

P-values are calculated based on the chi-squared test.

The highly significant SNPs (with P-values lower than the genome-wide significance threshold) discovered by the individual SNP approach with the corresponding haplotype blocks. Assoc. Allele: Associated Allele. AAF: Associated Allele Frequency. P-values are calculated based on the chi-squared test. In Table 3, the block sizes (in bp) – for the five biomarkers detected in the PHF19-TRAF1-C5 region – determined using the SSLD and CIT methods were the same. However, the SSLD block included more associated SNPs (12) than the CIT block (8), as depicted in Fig. 6. By further investigating this block, the four excluded SNPs by the CIT method were having MAFs less than 0.05 (a default condition in Haploview for the CIT method).

Fig. 6

Comparison for the CIT and SSLD methods on the same significant haplotype block in the PHF19-TRAF1-C5 region. (a) LD plot showing CIT block comprising eight biomarkers. (b) LD plot for SSLD block including twelve biomarkers. For the non-Chr 6 biomarkers shown in Table 3, these results were in line with those obtained by Eyre et al. [27] that verified the association of PTPN22 (rs2476601, P-value = 1.12 × 10−12) with RA for populations of European ancestry. Moreover, these two studies confirm the association of TRAF1 with RA, but for different SNPs. The detected biomarker in the present study was rs3761847 (P-value = 1.24 × 10−08), while rs10739580 (P-value = 1.7 × 10−06) was identified by Eyre et al. These two biomarkers are 163,211 bp apart from each other. A deeper view had been focused on the genes of the “never been reported” biomarkers in Table 3. Table 4 had been constructed using DAVID 6.8 to relate these genes to RA pathology and to link gene-disease associations. Ten genes were detected to play a role in RA pathology.

Table 4

Disease enrichment analysis for the genes of the “never been reported” biomarkers.

Gene name	Region	Functional pathway related to RA	Diseases affected by the gene
CDC42EP3	2p21	Induces pseudopodia formation in fibroblasts	Schizophrenia [59]
FAM82A1	2p22.2		Lung cancer [60]
PLCL1	2q33	Affects the bone density and the level of osteocalcin	Osteoporosis, hip bone size variation in females [61], intracranial aneurysm [62]
SATB2	2q33	Affects the activity of osteoblasts and the differentiation of immunocytes, plays a role in immune regulation, and elevations in the level of alkaline phosphatase	Cleft palate [63], [64], microdeletion syndrome [65], head and neck squamous cell carcinoma [66], colorectal carcinoma [67], laryngeal carcinoma [68], osteosarcoma [69], pancreatic cancer [70], esophageal carcinoma [71], hepatocellular carcinoma [72], HIV/AIDS infection [73], renal cell carcinoma [74], neuroendocrine tumors [75]
C3orf67	3p14.2
TIFA	4q25	Plays a role in the activation of IL-1, TRAF6, and IKK, affects the activation of NF-kappa-B
ZNF366	5q13.2	Plays a role in regulating the expression of genes in response to estrogen, affects the differentiation of dendritic cells and the production of IL-4, IL-10, IL-12, and NF-kappa-B	Osteoporosis [76], breast cancer [77], prostate cancer [78]
CPA4	7q32		Benign hypertrophic prostate, prostate cancer [79]
ZNF679	7q11.21
FIS1	7q22.1		Alzheimer's disease [80], leukemia [81], thyroid tumors [82]
RABL5	7q22.1
FAM49B	8q24.21		Endometriosis [83]
SH3PXD2A	10q24.33	Affects the activity of osteoclast	Breast cancer, melanoma [84], glioma [85], pre-eclampsia [86], lung adenocarcinoma [87], prostate cancer [88], colon cancer [89]
AIFM2	10q22.1		Ovarian cancer, retinoblastoma [90]
COL2A1	12q13.11	Plays a role in the activation of IL-6, Osteoarthritis, chondrodysplasia, epiphyseal dysplasia, joint deformity, spondyloepiphyseal dysplasia	Stickler and Wagner syndromes [91], chondrosarcomas [92], osteonecrosis of the femoral head [93], pathological myopia [94], congenital toxoplasmosis [95], Czech dysplasia [96], Legg-Calvé-Perthes [97]
SENP1	12q13.1	Plays a role in the activation of IL-6	Prostate cancer [98], leukemia, hepatoma [99]
MSI1	12q24.1-q24.31		Liver cancer, hepatoma, glioma and melanoma [100], neurodegenerative disorders [101], Helicobacter pylori infection [102], cervical carcinoma [103], endometriosis and endometrial carcinoma [104], medulloblastoma [105]
KIF26A	14q32.33
C14orf180	14q32.33
NECAB2	16q23.3
PTX4	16p13.3
TELO2	16p13.3		Glioma [106], intellectual disability [107], You-Hoover-Fong syndrome [108]
TMEM235	17q25.3		Cataract [109]
ZBTB7C	18q21.1		Sepsis [110], kidney cancer [111], cerebral ischemia [112]
CTIF	18q21.1		Hearing function [113]
SDF2L1	22q11.21		Insulinoma [114]
SEPT5	22q11.21	Involved in cytokinesis	Juvenile parkinsonism [115], pancreatic neoplasm [116], vitreoretinopathy [117], Parkinson's disease [118]
GP1BB	22q11.21-q11.23		Bernard-Soulier syndrome [119], Velocardiofacial syndrome [120], developmental delay, cardiac defects, dysmorphic facial features, palatal anomalies, hypocalcemia, and immune deficiency [121]
TBX1	22q11.21	expands T lymphocytes activity, affects the activity of fibroblastic growth factor	DiGeorge syndrome, pharyngeal and aortic arch defects [122], Velocardiofacial syndrome [123], psychiatric disorders [124], lung tumor [125], Tetralogy of Fallot [126], Conotruncal heart defects [127], ventricular septal defect [128], renal malformations [129], adenoid cystic carcinoma [130], cleft palate [131], indirect inguinal hernia [132], prostate cancer [133]

Disease enrichment analysis for the genes of the “never been reported” biomarkers. As shown in Table 4, TBX1 played a role in RA pathology through its immunological function. A study by Meziani et al. confirmed the association of TBX1 (rs4819522, P-value = 0.0014) with RA in both Japanese and Europeans using a meta-analysis [58]. The identified SNP in the present study (rs1005133, P-value = 4.08 × 10−08) was in a close proximity with the SNP obtained by Meziani et al. (28,427 bp). As shown in Table 3, rs1005133 was in a block with another SNP (rs5993820) detected by CIT and FGT methods. An LD plot was performed for the region that contained these two SNPs for unravelling other associations in that region from Chr 22. As depicted in Fig. 7, rs4819522 was neither in strong LD with rs1005133 (D′ = 0.2, r2 = 0.035) nor with rs5993820 (D′ = 0.411, r2 = 0.021).

Fig. 7

LD plot for the TBX1 region showing a biomarker in this study (rs1005133) and a previously detected biomarker (rs4819522).

LD plot for the TBX1 region showing a biomarker in this study (rs1005133) and a previously detected biomarker (rs4819522). The block similarity for the three applied methods of haplotype block partitioning are shown in Table 5. The similarity measure represents the SNPs detected by both methods in question divided by the total SNPs detected by the two methods. The highest block similarity was between CIT and FGT (mean ± SD = 0.464 ± 0.286). The block similarity between FGT and SSLD (mean ± SD = 0.21 ± 0.216) was nearly equal to that between CIT and SSLD (mean ± SD = 0.205 ± 0.193). The significance of these similarities was checked using one-way ANOVA with a post hoc t-test. The significance level for the three methods after Bonferroni correction was 0.0167 (0.05/3). The difference between (FGT and SSLD) and (CIT and SSLD) was not statistically significant (P-value = 0.936). The differences between (CIT and FGT) and (CIT and SSLD) and between (FGT and SSLD) and (FGT and CIT) were statistically significant (P-values = 0.001 and 0.002, respectively).

Table 5

Block similarity among the haplotype block methods for the twenty-two Chrs.

Chr no.	CIT vs FGT	FGT vs SSLD	SSLD vs CIT
1	88%	21%	23%
2	39%	0%	0%
3	34%	45%	20%
4	100%	0%	0%
5	40%	21%	30%
6	76%	74%	71%
7	9%	32%	6%
8	39%	30%	34%
9	49%	29%	25%
10	0%	0%	0%
11	53%	0%	0%
12	74%	18%	21%
13	71%	0%	0%
14	17%	36%	24%
15	39%	33%	23%
16	0%	0%	54%
17	52%	51%	35%
18	0%	0%	0%
19	64%	52%	43%
20	50%	18%	27%
21	75%	0%	11%
22	53%	2%	4%

Block similarity among the haplotype block methods for the twenty-two Chrs. As shown in Table 6, the SSLD method provided the best coverage of the hits obtained with the individual SNP approach, with 444 SNPs from 541 SNPs. The FGT method detected 432 SNPs, and the CIT method detected 415 SNPs. However, after excluding the hits on Chr 6, the FGT method was the best, detecting 45 out of 109 SNPs, and the CIT method (34 SNPs) performed better than the SSLD method (29 SNPs). The significance of the coverage by the three block methods of the hits obtained with the individual SNP approach was checked using one-way ANOVA with a post hoc t-test. The mean ± SD of the number of hits for CIT, FGT, and SSLD methods were 18.864 ± 80.909, 19.636 ± 82.071, and 20.182 ± 88.199, respectively. The significance level for the three methods after Bonferroni correction was 0.0167 (0.05/3). The difference among the three groups determined using ANOVA was not statistically significant (P-value = 0.999).

Table 6

The ability of each haplotype block method to capture the significant SNPs the determined with individual SNP approach.

Chr no.	Individual SNP	CIT	FGT	SSLD
1	4	1	1	1
2	2	0	0	0
3	5	1	2	1
4	5	3	3	0
5	6	2	2	2
6	432	381	387	415
7	7	0	2	3
8	11	6	6	2
9	11	7	7	7
10	5	0	1	2
11	2	1	1	0
12	3	0	1	1
13	0	0	0	0
14	5	2	2	1
15	3	0	0	0
16	7	0	1	1
17	4	1	2	1
18	3	0	2	0
19	5	2	2	3
20	8	1	3	3
21	7	5	4	0
22	6	2	3	1

The ability of each haplotype block method to capture the significant SNPs the determined with individual SNP approach. Most of the haplotype blocks that showed a high relationship with RA were in or near (+3 Mb) the major histocompatibility complex (MHC) region. Most of the 1021 SNPs detected by the three block methods were in the MHC region. These outcomes confirmed the firm association between the MHC region and RA susceptibility. Some associated SNPs were determined using all the methods, but others were observed by only one method. These differences could be due to several reasons. For the associations observed using only the individual SNP approach, it may be that only one SNP represents strong LD with the causal SNP. Therefore, studying haplotypes could decrease the power of association because they consist of several SNPs. For the associations observed using only the haplotype block methods, the individual SNP approach required approximately 81.71% more tests than the block methods. Consequently, the Bonferroni correction was more severe for the individual SNP approach. The block methods were able to detect the interactions among many causal SNPs. In addition, haplotypes could capture rare alleles that may not be reflected by individual SNPs. The reason for this difference could be that the power to observe associations is maximized when the frequencies of the studied biomarker and the causal SNP are similar. Some associations were observed using one but not the other haplotype block methods because each method differs greatly in its scope of the definition of a haplotype block. The limitations of this study are as follows: (a) the effects of population stratification were not accounted for; (b) a replication study in other datasets was not performed; and (c) other haplotype block methods, such as those based on hidden Markov models [134], [135], dynamic programming-based algorithms [136], [137], [138], [139], [140], wavelet decomposition [141], greedy algorithms [142], the minimum description length principle [143], [144], spatial correlation of SNPs [145], sequence kernel association tests [146], and block entropy [147], were not included.

Conclusions

Applying the individual SNP approach and the three block methods to the NARAC dataset will in turn maximize the system’s ability to discover crucial associations. In terms of selecting a method, SSLD would be the most appropriate for the NARAC dataset. The SSLD method has valuable advantages such as the highest genomic coverage; the largest minimum, median, and maximum significant block sizes; the highest number of significant SNPs included in blocks; and the highest number of associated SNPs discovered exclusively by a single method. In total, 355 SNPs showed a P-value lower than the genome-wide significance threshold. Among them (after excluding Chr 6 results – 320 SNPs), 20 SNPs corresponding to 29 genes were not detected before for the RA susceptibility. Reviewing the literature, 10 genes from these 29 genes, namely, CDC42EP3, PLCL1, SATB2, TIFA, ZNF366, SH3PXD2A, COL2A1, SENP1, SEPT5, and TBX1, played a role in RA pathogenesis. As a future perspective, a replication study should be conducted to confirm the GWAS findings.

Conflict of interest

The authors have declared no conflict of interest.

Compliance with Ethics Requirements

This article does not contain any studies with human or animal subjects.

3 in total

1. Genome-Wide Association Study of Fluorescent Oxidation Products Accounting for Tobacco Smoking Status in Adults from the French EGEA Study.

Authors: Laurent Orsi; Patricia Margaritte-Jeannin; Miora Andrianjafimasy; Orianne Dumas; Hamida Mohamdi; Emmanuelle Bouzigon; Florence Demenais; Régis Matran; Farid Zerimech; Rachel Nadif; Marie-Hélène Dizier
Journal: Antioxidants (Basel) Date: 2022-04-20

Review 2. Genetics of rheumatoid arthritis.

Authors: Leonid Padyukov
Journal: Semin Immunopathol Date: 2022-01-27 Impact factor: 9.623

3. Genetic architecture of type 1 diabetes with low genetic risk score informed by 41 unreported loci.

Authors: Hui-Qi Qu; Jingchun Qu; Jonathan Bradfield; Luc Marchand; Joseph Glessner; Xiao Chang; Michael March; Jin Li; John J Connolly; Jeffrey D Roizen; Patrick Sleiman; Constantin Polychronakos; Hakon Hakonarson
Journal: Commun Biol Date: 2021-07-23

3 in total

Chr no.	Total no. of significant SNPs obtained by the individual SNP method	No. of significant SNPs obtained by only the individual SNP method	No. of significant SNPs obtained by all three block methods	No. of significant SNPs obtained by all four methods
1	4	3	8	1
2	2	2	0	0
3	5	3	7	0
4	5	2	0	0
5	6	4	8	2
6	432	12	916	367
7	7	3	2	0
8	11	3	14	1
9	11	4	16	7
10	5	2	0	0
11	2	1	0	0
12	3	1	6	0
13	0	0	0	0
14	5	2	11	1
15	3	3	5	0
16	7	5	0	0
17	4	2	11	0
18	3	1	0	0
19	5	2	13	2
20	8	5	3	1
21	7	2	0	0
22	6	3	1	1

Chr no.	Individual SNP	CIT	FGT	SSLD
1	4	1	1	1
2	2	0	0	0
3	5	1	2	1
4	5	3	3	0
5	6	2	2	2
6	432	381	387	415
7	7	0	2	3
8	11	6	6	2
9	11	7	7	7
10	5	0	1	2
11	2	1	1	0
12	3	0	1	1
13	0	0	0	0
14	5	2	2	1
15	3	0	0	0
16	7	0	1	1
17	4	1	2	1
18	3	0	2	0
19	5	2	2	3
20	8	1	3	3
21	7	5	4	0
22	6	2	3	1

Chr no.	Total no. of significant SNPs obtained by the individual SNP method	No. of significant SNPs obtained by only the individual SNP method	No. of significant SNPs obtained by all three block methods	No. of significant SNPs obtained by all four methods
1	4	3	8	1
2	2	2	0	0
3	5	3	7	0
4	5	2	0	0
5	6	4	8	2
6	432	12	916	367
7	7	3	2	0
8	11	3	14	1
9	11	4	16	7
10	5	2	0	0
11	2	1	0	0
12	3	1	6	0
13	0	0	0	0
14	5	2	11	1
15	3	3	5	0
16	7	5	0	0
17	4	2	11	0
18	3	1	0	0
19	5	2	13	2
20	8	5	3	1
21	7	2	0	0
22	6	3	1	1

Chr no.	Individual SNP	CIT	FGT	SSLD
1	4	1	1	1
2	2	0	0	0
3	5	1	2	1
4	5	3	3	0
5	6	2	2	2
6	432	381	387	415
7	7	0	2	3
8	11	6	6	2
9	11	7	7	7
10	5	0	1	2
11	2	1	1	0
12	3	0	1	1
13	0	0	0	0
14	5	2	2	1
15	3	0	0	0
16	7	0	1	1
17	4	1	2	1
18	3	0	2	0
19	5	2	2	3
20	8	1	3	3
21	7	5	4	0
22	6	2	3	1

Chr no.	Total no. of significant SNPs obtained by the individual SNP method	No. of significant SNPs obtained by only the individual SNP method	No. of significant SNPs obtained by all three block methods	No. of significant SNPs obtained by all four methods
1	4	3	8	1
2	2	2	0	0
3	5	3	7	0
4	5	2	0	0
5	6	4	8	2
6	432	12	916	367
7	7	3	2	0
8	11	3	14	1
9	11	4	16	7
10	5	2	0	0
11	2	1	0	0
12	3	1	6	0
13	0	0	0	0
14	5	2	11	1
15	3	3	5	0
16	7	5	0	0
17	4	2	11	0
18	3	1	0	0
19	5	2	13	2
20	8	5	3	1
21	7	2	0	0
22	6	3	1	1

Chr no.	Individual SNP	CIT	FGT	SSLD
1	4	1	1	1
2	2	0	0	0
3	5	1	2	1
4	5	3	3	0
5	6	2	2	2
6	432	381	387	415
7	7	0	2	3
8	11	6	6	2
9	11	7	7	7
10	5	0	1	2
11	2	1	1	0
12	3	0	1	1
13	0	0	0	0
14	5	2	2	1
15	3	0	0	0
16	7	0	1	1
17	4	1	2	1
18	3	0	2	0
19	5	2	2	3
20	8	1	3	3
21	7	5	4	0
22	6	2	3	1