Literature DB >> 31869403

Use of >100,000 NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium whole genome sequences improves imputation quality and detection of rare variant associations in admixed African and Hispanic/Latino populations.

Madeline H Kowalski¹, Huijun Qian², Ziyi Hou³, Jonathan D Rosen¹, Amanda L Tapia¹, Yue Shan¹, Deepti Jain⁴, Maria Argos⁵, Donna K Arnett⁶, Christy Avery⁷, Kathleen C Barnes⁸, Lewis C Becker⁹, Stephanie A Bien¹⁰, Joshua C Bis¹¹, John Blangero¹², Eric Boerwinkle^13,14, Donald W Bowden¹⁵, Steve Buyske¹⁶, Jianwen Cai¹⁷, Michael H Cho^18,19, Seung Hoan Choi²⁰, Hélène Choquet²¹, L Adrienne Cupples^22,23, Mary Cushman²⁴, Michelle Daya⁸, Paul S de Vries¹⁴, Patrick T Ellinor^20,25, Nauder Faraday⁹, Myriam Fornage²⁶, Stacey Gabriel²⁷, Santhi K Ganesh^28,29, Misa Graff⁷, Namrata Gupta²⁷, Jiang He³⁰, Susan R Heckbert^31,32, Bertha Hidalgo³³, Chani J Hodonsky⁷, Marguerite R Irvin³³, Andrew D Johnson^23,34, Eric Jorgenson²¹, Robert Kaplan³⁵, Sharon L R Kardia³⁶, Tanika N Kelly³⁰, Charles Kooperberg¹⁰, Jessica A Lasky-Su^18,19, Ruth J F Loos^37,38, Steven A Lubitz^20,25, Rasika A Mathias⁹, Caitlin P McHugh⁴, Courtney Montgomery³⁹, Jee-Young Moon³⁵, Alanna C Morrison¹⁴, Nicholette D Palmer¹⁵, Nathan Pankratz⁴⁰, George J Papanicolaou⁴¹, Juan M Peralta¹², Patricia A Peyser³⁶, Stephen S Rich⁴², Jerome I Rotter⁴³, Edwin K Silverman^18,19, Jennifer A Smith³⁶, Nicholas L Smith^31,32,44, Kent D Taylor⁴³, Timothy A Thornton⁴, Hemant K Tiwari⁴⁵, Russell P Tracy⁴⁶, Tao Wang⁴⁷, Scott T Weiss^18,19, Lu-Chen Weng²⁰, Kerri L Wiggins¹¹, James G Wilson⁴⁸, Lisa R Yanek⁹, Sebastian Zöllner^49,50, Kari E North^7,51, Paul L Auer⁵², Laura M Raffield⁵³, Alexander P Reiner³¹, Yun Li^1,53,54.

Abstract

Most genome-wide association and fine-mapping studies to date have been conducted in individuals of European descent, and genetic studies of populations of Hispanic/Latino and African ancestry are limited. In addition, these populations have more complex linkage disequilibrium structure. In order to better define the genetic architecture of these understudied populations, we leveraged >100,000 phased sequences available from deep-coverage whole genome sequencing through the multi-ethnic NHLBI Trans-Omics for Precision Medicine (TOPMed) program to impute genotypes into admixed African and Hispanic/Latino samples with genome-wide genotyping array data. We demonstrated that using TOPMed sequencing data as the imputation reference panel improves genotype imputation quality in these populations, which subsequently enhanced gene-mapping power for complex traits. For rare variants with minor allele frequency (MAF) < 0.5%, we observed a 2.3- to 6.1-fold increase in the number of well-imputed variants, with 11-34% improvement in average imputation quality, compared to the state-of-the-art 1000 Genomes Project Phase 3 and Haplotype Reference Consortium reference panels. Impressively, even for extremely rare variants with minor allele count <10 (including singletons) in the imputation target samples, average information content rescued was >86%. Subsequent association analyses of TOPMed reference panel-imputed genotype data with hematological traits (hemoglobin (HGB), hematocrit (HCT), and white blood cell count (WBC)) in ~21,600 African-ancestry and ~21,700 Hispanic/Latino individuals identified associations with two rare variants in the HBB gene (rs33930165 with higher WBC [p = 8.8x10-15] in African populations, rs11549407 with lower HGB [p = 1.5x10-12] and HCT [p = 8.8x10-10] in Hispanics/Latinos). By comparison, neither variant would have been genome-wide significant if either 1000 Genomes Project Phase 3 or Haplotype Reference Consortium reference panels had been used for imputation. Our findings highlight the utility of the TOPMed imputation reference panel for identification of novel rare variant associations not previously detected in similarly sized genome-wide studies of under-represented African and Hispanic/Latino populations.

Entities: Chemical

Mesh：

Substances：
beta-Globins

Year: 2019 PMID： 31869403 PMCID： PMC6953885 DOI： 10.1371/journal.pgen.1008500

Source DB: PubMed Journal: PLoS Genet ISSN： 1553-7390 Impact factor: 6.020

Introduction

Genotype imputation, despite being a standard practice in modern genetic association studies, remains challenging in populations of Hispanic/Latino or African ancestry, particularly for rare variants [1-6]. One obstacle lies in the lack of appropriate whole genome sequence reference panels for these admixed populations. For individuals of European descent, the relevant haplotypes available have increased by more than 500 times from 120 phased sequences in HapMap2 [7] to more than 64,000 phased sequences in Haplotype Reference Consortium (HRC) [8] reference. However, HRC is predominantly European (other than included 1000 Genomes Project Phase 3 (1000G) SNPs) and includes mostly low-coverage sequencing data (4-8x coverage). The state-of-the-art reference panels for African-ancestry (AA) and Hispanic/Latino cohorts, including the 1000 Genomes Project Phase 3 (1000G) [9] and the Consortium on Asthma among African-ancestry Populations in the Americas (CAAPA) [10], are at least one order of magnitude smaller than HRC. This is especially problematic given the complex LD structure in admixed populations. The NHLBI Trans-Omics for Precision Medicine (TOPMed) Project has recently generated deep-coverage (mean depth 30x) whole genome sequencing (WGS) on more than 50,000 individuals from >26 cohorts and from diverse ancestral backgrounds (notably including ~26% AA and ~10% Hispanic/Latino participants), and now provides an unprecedented opportunity for substantially enhancing imputation quality in under-represented admixed populations and subsequently boosting power for mapping genes and regions underlying complex traits. Here we demonstrate the improvements in rare variant imputation quality in AA and Hispanic/Latino populations using TOPMed as a reference panel versus 1000G and HRC panels, and subsequently identify two low-frequency/rare HBB variant associations with blood cell traits in AA and Hispanic/Latino samples using TOPMed-imputed genotyping array data.

Results and discussion

The cohort and ancestry composition of the TOPMed freeze 5b whole genome sequence reference panel used in our study and the samples with array-based genotyping used for imputation and hematological traits association analyses in self-identified AA and Hispanic/Latino individuals are summarized in S1 and S2 Tables, respectively. We first selected two large U.S. minority cohorts—one AA and one Hispanic/Latino—in order to comprehensively evaluate imputation quality: the Jackson Heart Study (JHS, all AA, n = 3,082) and the Hispanic Community Health Study/Study of Latinos (HCHS/SOL, all Hispanic/Latino, n = 11,887). Both the JHS and HCHS/SOL have external sources of dense genotype data available for comparison. JHS is the largest AA general population cohort sequenced in TOPMed freeze 5b. Therefore, we removed JHS samples from the TOPMed freeze 5b reference panel prior to performing imputation into JHS samples using SNPs genotyped on the Affymetrix 6.0 array, treating the TOPMed freeze 5b calls as true genotypes for evaluation of imputation quality in JHS. HCHS/SOL is the largest and most regionally diverse population-based cohort of Hispanic/Latino individuals living in the US. For HCHS/SOL, we used the entire set of 100,506 phased sequences from TOPMed freeze 5b (including JHS) as reference and performed imputation into 11,887 Hispanic/Latino samples genotyped on the Illumina Omni 2.5 SOL custom array (with high quality genotypes at 2,293,536 markers). As the external source of genotype validation in HCHS/SOL, we used genotypes from the Illumina MEGA array genotyping data (containing >1.7 million multi-ethnic global markers, including low frequency coding variants and ancestry-specific variants) available in the same HCHS/SOL samples to assess imputation quality, evaluating 688,189 imputed markers available on MEGA but not on Omni2.5. Compared with the 1000G Phase 3 reference panel [9], we were able to increase the number of well-imputed variants from ~28 and ~35 million to ~51 and ~58 million in JHS and HCHS/SOL, respectively (see S3 Table for genome-wide distribution of well-imputed variants). We defined well-imputed variants based on our previous work [1, 2, 4], using MAF-specific estimated R2 thresholds to ensure an average R2 ≥ 0.8 in each imputed cohort separately. For all rare variants with MAF < 0.5%, we observed ~4.2X (2.3X) and ~6.1X (3.3X) increases in the number of well-imputed variants in JHS (HCHS/SOL), compared with 1000G and HRC, respectively. We also observed 22% (11%) and 34% (20%) increases in imputation information content (as measured by average true R2, which is the squared Pearson correlation between imputed and true genotypes) (Fig 1 and S1 Fig, Table 1). For very rare variants with MAF <0.05%, we observed ~22.1X (5.8X) and ~11.8X (10.7X) increases in the number of well-imputed variants, with 6% (5%) and 13% (11%) increases in average true R2, in JHS (HCHS/SOL), compared with 1000G and HRC respectively. Mismatch rates between true and imputed genotypes were low; using the program CalcMatch, the mean concordance for heterozygote individuals (generally the hardest to impute) for Jackson Heart Study is 97.5% for all well imputed variants in Table 1, 96.6% for MAF <0.5%, and 97.6% for MAF < 0.05%. For HCHS/SOL, the mean concordance is 98.2% for all well imputed variants, 92.9% for MAF <0.5%, and 83.8% for MAF < 0.05%. Most well-imputed variants from 1000G and HRC were also included in TOPMed freeze 5b imputation results (S2 Fig).

Fig 1

Comparison of imputation reference panels, for variants with MAF < 1%.

Table 1

Number of well-imputed variants using TOPMed freeze 5b, 1000 Genomes Phase 3 (1000G) and Haplotype Reference Consortium (HRC).

Imputation Reference Panel	Total number of variants in reference panel	Total number of well imputed variants		Total number of well imputed variants with MAF<0.5%				Total number of well imputed variants with MAF<0.05%
		JHS	HCHS/SOL	JHS	avgTrueR²	HCHS/SOL	avgTrueR²	JHS	avgTrueR²	HCHS/SOL	avgTrueR²
TOPMed5b	88,062,238	51,467,522	57,845,194	33,355,468	89.89%	44,439,594	89.99%	16,205,279	88.64%	28,230,718	75.21%
1000G	49,143,605	28,454,330	35,178,969	7,857,211	73.60%	19,192,645	81.08%	734,063	83.72%	4,901,159	71.39%
HRC	39,635,008	21,745,746	26,012,190	5,488,848	67.33%	13,330,317	75.19%	1,371,526	78.77%	2,637,393	67.78%

HCHS/SOL, Hispanic Community Health Study/Study of Latinos, JHS, Jackson Heart Study, MAF, minor allele frequency

The total number of well imputed variants is extrapolated from three selected 3 Mb regions: 16-19Mb region from chromosomes 3, 12, and 20. These regions were chosen arbitrarily across a range of chromosome sizes, avoiding centromere, telomere, and low-mappability regions. Imputation was carried out using all typed SNPs +/-1Mb (i.e., 15-20Mb) and quality was evaluated in the core 3Mb region. Post imputation quality control was carried out in seven MAF categories separately: < .05%, .05-.2%, .2-.5%, .5–1%, 1–3%, 3–5%, and >5%. In each MAF category, an estimated R2 threshold (standard imputation software metric calculated based on the ratio of observed variance in imputed dosages over expected variance based on allele frequencies) was selected to ensure variants above the threshold have an average estimated R2 of at least 0.8. These variants constitute the well imputed variants. For variants with a MAF<0.5% and <0.05%, respectively, we additionally assessed avgTrueR2, average true squared Pearson correlation between imputed genotypes and genotypes from available whole genome sequencing data (JHS) or genotyping array data (HCHS/SOL).

Comparison of imputation reference panels, for variants with MAF < 1%.

Imputation quality (measured by true R2 [Y-axis]) is plotted with progressively more stringent post-imputation filtering from left to right, with filtering according to estimated R2 (X-axis), for variants with MAF < 1%. Top panels are for the JHS cohort and bottom panels for the HCHS/SOL cohort. Three reference panels are shown: TOPMed (TOPMed freeze 5b), 1000G (the 1000 Genomes Phase 3), and HRC (the Haplotype Reference Consortium). HCHS/SOL, Hispanic Community Health Study/Study of Latinos, JHS, Jackson Heart Study, MAF, minor allele frequency The total number of well imputed variants is extrapolated from three selected 3 Mb regions: 16-19Mb region from chromosomes 3, 12, and 20. These regions were chosen arbitrarily across a range of chromosome sizes, avoiding centromere, telomere, and low-mappability regions. Imputation was carried out using all typed SNPs +/-1Mb (i.e., 15-20Mb) and quality was evaluated in the core 3Mb region. Post imputation quality control was carried out in seven MAF categories separately: < .05%, .05-.2%, .2-.5%, .5–1%, 1–3%, 3–5%, and >5%. In each MAF category, an estimated R2 threshold (standard imputation software metric calculated based on the ratio of observed variance in imputed dosages over expected variance based on allele frequencies) was selected to ensure variants above the threshold have an average estimated R2 of at least 0.8. These variants constitute the well imputed variants. For variants with a MAF<0.5% and <0.05%, respectively, we additionally assessed avgTrueR2, average true squared Pearson correlation between imputed genotypes and genotypes from available whole genome sequencing data (JHS) or genotyping array data (HCHS/SOL). Even for extremely rare variants with sample minor allele count (MAC) <10 (including cohort singleton variants in the target JHS cohort), average information content rescued (again measured by true R2) was >86%. For example, out of the 8.67 million singleton variants discovered in JHS by TOPMed WGS, 72% (6.24 million) can be well-imputed using Affymetrix 6.0 genotypes and using TOPMed freeze 5b (without JHS individuals) as reference, with an average true R2 of 0.92 (Table 2). Singletons within JHS are defined as variants with MAC = 1 among the JHS samples but which are present in multiple copies in the reference panel. Specifically, the average reference MAC is 29.3 before post-imputation quality control (QC) and 31.0 after QC, with all variants having a MAC>5 in the overall reference panel. Imputation quality is similarly high when examining extremely rare MAC variants in the reference panel, and even higher, as expected, with higher MAC variants within the JHS sample (S4 and S5 Tables). Similar observations hold true for HCHS/SOL, with slightly lower imputation quality (S6 and S7 Tables). Compared to JHS African Americans, the lower imputation quality in HCHS/SOL Hispanic/Latino individuals is likely attributable to multiple reasons, including (1) the more complex LD structure among Hispanic/Latino individuals due to the admixture of three ancestral populations; (2) the availability of a much smaller subset of rare variants for quality evaluation through MEGA array genotyping in HCHS/SOL (in contrast to the availability of nearly all segregating variants in JHS through high-coverage sequencing); and (3) the smaller number of relevant haplotypes in the TOPMed freeze 5b reference (~26% self-identified AAs compared to ~10% self-identified Hispanics/Latinos). Imputation quality for rare and low-frequency variants that are estimated to be well imputed in Table 1 is further stratified by regional background in HCHS/SOL and displayed in S8 Table. We note that greater numbers of AA and Hispanic/Latino individuals will be included in future releases of sequencing datasets from TOPMed, which we anticipate will further improve imputation quality; inclusion of JHS itself in imputation for other AA cohorts would also improve imputation quality.

Table 2

Imputation quality for rare variants (minor allele count< = 10) in the Jackson Heart Study (JHS).

JHS MAC	#Variants	#QC+	avgMAC	avgMAC_QC+	avgEstR²	avgTrueR²
1	8,673,112	6,236,211	29.3	31.0	86.9%	92.0%
2	5,488,071	4,502,844	37.2	39.0	89.0%	86.7%
3	3,865,676	3,304,749	46.7	48.4	90.3%	86.2%
4	2,786,048	2,425,855	59.1	60.9	91.1%	86.4%
5	2,058,252	1,809,190	73.7	75.8	91.6%	86.9%
6	1,570,124	1,377,280	91.0	93.9	92.0%	87.5%
7	1,223,738	1,088,972	110.3	112.4	92.3%	88.1%
8	992,012	890,572	127.3	129.0	92.5%	88.5%
9	836,222	753,584	145.7	147.4	92.8%	89.1%
10	713,541	643,909	163.4	165.0	93.0%	89.5%

MAC, minor allele count, #Variants, total number of variants with a given MAC in JHS which overlapped with the TOPMed freeze 5b reference panel, QC+, number of these variants which passed imputation quality control, avgMAC, the average minor allele count in the (TOPMed freeze 5b minus JHS) reference panel of these variants, avgMAC_QC+, the average minor allele count in the (TOPMed freeze 5b minus JHS) reference panel of variants which passed imputation quality control. avgEstR2, average estimated R2 for imputed variants (standard imputation software metric calculated based on the ratio of observed variance in imputed dosages over expected variance based on allele frequencies), avgTrueR2, average true squared Pearson correlation between imputed genotypes and genotypes from available whole genome sequencing data. Variants that did not have a MAC>5 in the full TOPMed freeze 5b reference panel were not evaluated. Encouraged by these substantial gains in information content for low-frequency and rare variants, we proceeded with imputation in several additional AA and Hispanic/Latino data sets with array-based genotyping (S1 and S9 Tables), followed by association analyses with quantitative blood cell traits to evaluate the power of TOPMed freeze 5b-based imputation in minorities for discovery of genetic variants underlying complex human traits. We specifically chose hematological traits for several reasons. First, these traits are important intermediate clinical phenotypes for a variety of cardiovascular, hematologic, oncologic, immunologic, and infectious diseases [11]. Second, these traits have family-based heritability estimates in the range of 40–65% [12, 13], and have been highly fruitful for gene-mapping with >2,700 common and rare variants identified, though primarily in individuals of European ancestry [14-19]. Third, these traits remain under-studied in admixed AA and Hispanic/Latino populations, despite evidence for the existence of variants with distinct genetic architecture in AAs and Hispanics/Latinos [20-22]. For example, while hundreds of variants identified in genome-wide association studies (GWAS) of WBC in individuals of European descent explain only ~7% of array heritability, the African specific Duffy null variant DARC rs2814778 alone accounts for 15–20% of population-level WBC variability in AAs [23]. Finally, we have previously successfully leveraged deep-coverage exome sequencing-based imputation using resources from the Exome Sequencing Project for more powerful mapping of genes and regions associated with hematological traits in AAs [1]. Hemoglobin level (HGB), hematocrit (HCT), and WBC were chosen for our primary phenotypic analysis because these traits are available in the largest sample size among the AA and Hispanics/Latinos included in our discovery cohorts. Our imputation sample used for discovery blood cell trait association analyses included eight cohorts (21,513 AAs and 21,689 Hispanics/Latinos) (S1 Table). These discovery samples do not overlap with individuals sequenced as part of TOPMed freeze 5b (S2 Table). We used the full set of 100,506 phased sequences from TOPMed freeze 5b (including JHS) as the imputation reference panel. We then carried out AA- and Hispanic/Latino-stratified association analyses with quantitative HGB, HCT, and total WBC separately in each cohort genotyping array data set, accounting for ancestry and relatedness. The genome-wide association results for each imputed cohort data set were then meta-analyzed within each ancestry group. S3–S8 Figs show the Manhattan plots from ethnic-specific meta-analyses for each trait. QQ plots (S9–S14 Figs) show no obvious early departure, with genomic control lambda ranging from 1.008 to 1.044, indicating minimal global inflation of test statistics. For replication of any novel associations identified in the imputation-based discovery analysis, we utilized WGS genotype data and hematological trait data from the non-overlapping set of AA individuals within TOPMed freeze 5b (S10 Table) (see Methods for details). We first evaluated association statistics for variants previously associated with HGB, HCT, or WBC count in AA and Hispanic/Latino populations (summarized in S11 Table). We assembled a list of 24 AA and 13 Hispanic/Latino previously identified autosomal signals from prior published GWAS or exome-based studies [1, 19, 20, 24–30]. Our lists excluded variants reported in multi-ethnic cohorts or meta-analysis including individuals of non-AA or non- Hispanic/Latino ancestry to guard against the scenario that the reported signals were driven predominantly by individuals of European or Asian ancestry. Among the previously reported 24 AA and 13 Hispanic/Latino variants, all but five (four SNPs and a 3.8 kb deletion variant esv2676630) passed variant quality-control filters in TOPMed freeze 5b and were subsequently well-imputed in our target AA and Hispanic/Latino data sets with a stringent post-imputation R2 filter of >0.8 (detailed in S12 Table). Among the 31 known HGB, HCT, or WBC count associations testable with TOPMed freeze 5b, our imputed/discovery cohorts confirmed 84% of these previously reported findings with a consistent direction of effect, using a stringent genome-wide significant threshold of p<5x10-8. Using more lenient p-value thresholds, we could replicate 94% (p<5x10-6) and 100% (p<0.05) of the previously reported findings with the same direction of effect. While these results help confirm the overall validity of our hematological trait association results, it is important to note for these comparisons that many of the samples included in the current TOPMed freeze 5b imputed genome-wide association analysis were also used in the publications originally reporting associations in AA and Hispanic/Latino individuals. Our ancestry-stratified imputation-based discovery meta-analysis revealed two blood cell trait associations that have not been previously reported, at a genome-wide significant threshold of p<5×10−9 in Hispanics/Latinos and p<1x10-9 in AA populations, based on appropriate significance thresholds for whole genome sequencing analysis [31]. One signal was revealed in each ancestry group: hemoglobin subunit beta (HBB) missense (p.Glu7Lys) variant rs33930165 (gb38:11:5227003:C:T) associated with increased WBC in AAs (β = 0.35 and p = 8.8x10-15, adjusting for SNP rs2814778 and removing potential minor allele homozygotes) (Table 3), and HBB stop-gain (p.Gln40Ter) variant rs11549407 (gb38:11:5226774:G:A) associated with lower HGB and HCT in Hispanics/Latinos (β = -1.92, p = 1.5x10-12; β = -1.66, p = 8.8x10-10). Both variants were either low frequency or rare: the HBB missense variant rs33930165 (hemoglobin C variant) has a MAF of 1.14% among the imputed AA discovery samples and is even rarer in non-AA individuals (absent in Europeans in 1000G); the stop gain variant rs11549407 has a MAF of 0.03% (MAC ~ 15) among the imputed Hispanics/Latinos and is monomorphic among the AAs. Both variants are classified as pathogenic in ClinVar. Both variants were well imputed with R2 ranging from 0.831 to 0.994 and 0.862 to 0.999 in the contributing AA and Hispanic/Latino cohorts, respectively (Table 3). Due to the low allele frequency of these variants in AAs and Hispanics/Latinos and even lower frequency in individuals of European descent, both variants were imputed with lower quality using other reference panels (S13 and S14 Tables): the missense variant HBB rs33930165 had R2 as low as 0.127 and 0.456 using 1000G and HRC, respectively, as references; the HBB stop-gain variant rs11549407 was not available in the 1000G reference panel and had R2 as low as 0.413 using HRC as the reference panel. Carrying the 1000G and HRC imputed genotypes forward to association analyses with hematological traits in the subset of our target imputation cohorts where the variants were well imputed (R2 > 0.8), we observed none of the p-values exceeded genome-wide significance threshold. This explains why these variants were not detectable at a genome-wide significant level using previously available imputation reference panels, with obvious implications for other complex trait association studies in ancestrally diverse study populations.

Table 3

Novel variants detected in TOPMed freeze 5b imputed Hispanic/Latino and African ancestry cohorts, in association analyses with white blood cell count, hemoglobin, and hematocrit.

Ancestry	rs#	Estimated R^{2 1}	Phenotype	Effect allele	EAF	β	SE	P-value	Replication β	Replication P-value	Gene	Annotation
African ancestry	rs33930165	0.831–0.994	WBC	T	1.14%	0.35	0.04	8.8E-15	0.27	4.6E-04	HBB	missense (p.Glu7Lys)
Hispanic/Latino	rs11549407	0.862–1.000	HCT	A	0.03%	-1.66	0.27	8.8E-10	NA⁴		HBB	stop gain (p.Gln40Ter)
Hispanic/Latino	rs11549407	0.862–1.000	HGB	A	0.03%	-1.92	0.27	1.5E-12	NA⁴		HBB	stop gain (p.Gln40Ter)

EAF, effect allele frequency, HCT, hematocrit, HGB, hemoglobin, WBC, white blood cell count.

Imputation R2 (estimated R2) range reported across all included imputed cohorts.

Association results adjusted for nearby known SNPs whenever applicable.

Association models for rs33930165 were adjusted for SNP rs2814778; removing potential minor allele homozygotes

Association models for rs11549407 were adjusted for SNPs rs334, rs33930165, and rs2213169 rs334 and rs2213169 did not pass variant quality filters in TOPMed freeze 5b and were not included in our main analyses. However, to follow up our novel results in the HBB locus, we phased the failed variants in freeze5b and performed targeted imputation using TOPMed freeze 5b calls for rs334 and rs2213169

NA: among TOPMed freeze 5b Hispanic/Latino individuals, MAC = 1 so association statistics are not available

EAF, effect allele frequency, HCT, hematocrit, HGB, hemoglobin, WBC, white blood cell count. Imputation R2 (estimated R2) range reported across all included imputed cohorts. Association results adjusted for nearby known SNPs whenever applicable. Association models for rs33930165 were adjusted for SNP rs2814778; removing potential minor allele homozygotes Association models for rs11549407 were adjusted for SNPs rs334, rs33930165, and rs2213169 rs334 and rs2213169 did not pass variant quality filters in TOPMed freeze 5b and were not included in our main analyses. However, to follow up our novel results in the HBB locus, we phased the failed variants in freeze5b and performed targeted imputation using TOPMed freeze 5b calls for rs334 and rs2213169 NA: among TOPMed freeze 5b Hispanic/Latino individuals, MAC = 1 so association statistics are not available Both of our previously unreported genotype-trait associations involve coding variants of HBB, which encodes the beta polypeptide chains in adult hemoglobin. The HBB stop gain (p.Gln40Ter) variant 11:5226774:G:A (rs11549407) is the most common cause of beta zero thalassemia in West Mediterranean countries, particularly among the founder population of Sardinia [32, 33], where the variant has a population allele frequency of ~5%. The Sardinian population is represented in the HRC reference panel (~3500 individuals), which likely contributes to the reasonable imputation quality observed using HRC in most but not all cohorts, in contrast to the absence of this variant in the 1000G reference panel due to very low minor allele count, though imputation quality was clearly improved with the TOPMed freeze 5b reference panel. The p.Gln40Ter mutation is much less prevalent outside of the Western Mediterranean, but has been detected among individuals with beta thalassemia among admixed populations from Central and South America [34, 35], which are geographically and genetically similar to some of the Hispanic/Latino samples included in our imputation-based discovery sample. While the individuals carrying the HBB p.Gln40Ter allele in our unselected population-based Hispanic/Latino sample were all imputed heterozygotes (consistent with “thalassemia minor” and generally considered healthy), there is increasing evidence that silent carriers of beta-thalassemia and sickle cell mutations may be at risk for various health-related conditions [36, 37]. Due to the relatively small number of Hispanic/Latino individuals with blood cell trait data in TOPMed freeze 5b (n~1,080), including only one heterozygote carrier of rs11549407 in those with blood cell traits measured, we were unable to perform a well-powered replication of the association of rs11549407 with HGB and HCT. Moderate anemia is known to occur in some individuals with thalassemia minor, however, concordant with our results [38]. The association of the HBB missense (p.Glu7Lys) variant 11:5227003:C:T or rs33930165 with higher total WBC (β = 0.35, p = 8.8x10-15) among AA was unexpected; rs33930165 has been associated with red blood cell indices such as mean corpuscular hemoglobin concentration [20] but not with white blood cell traits. Because of the higher allele frequency of this variant and also the larger number of AA samples (n = 6,743) in TOPMed freeze 5b, we were able to replicate this HBB rs33930165 association with total WBC in an independent sample (β = 0.27 and p = 4.6x10-4) of AA individuals. By contrast, there was no significant association of the HBB rs33930165 p.Glu7Lys variant with HGB and a modest association with lower HCT in the AA discovery and replication data sets (discovery HCT β = -0.122, p = 0.012; HGB β = 0.110, p = 0.022; replication HCT β = -0.239, p = 0.002; HGB β = -0.009, p = 0.909). The minor allele T of rs33930165 encodes an abnormal form of hemoglobin, Hb C, which in the homozygous state is associated with mild chronic hemolytic anemia and mild to moderate splenomegaly [39]. In our discovery and replication data sets, there were no individuals homozygous for the Hb C variant, nor any compound heterozygotes for Hb S/C (Hb S is sickling form of hemoglobin and individuals homozygous for Hb S have sickle cell disease), which excludes the possibility that the apparently higher WBC is driven by an “inflammatory response” confined to a small number of individuals clinically affected by sickle cell disease or hemoglobin C disease. We next evaluated the association of HBB rs33930165 with circulating number of WBC subtypes, including neutrophils, monocytes, lymphocytes, basophils, and eosinophils. S15 Table shows the results in our AA imputation-based discovery data sets (S16 Table), and TOPMed freeze 5b WGS replication samples (S17 Table), which suggest that the apparent association of HBB rs33930165 with total WBC is mainly driven by an association with higher lymphocyte count, with perhaps a more modest association with higher neutrophil count. Further studies are needed to delineate the putative mechanism of this unexpected association. Our findings showcase the power of the large, ancestrally diverse TOPMed WGS data set as an imputation reference panel for admixed populations, in terms of both imputation quality and accuracy (especially for rare variants) and subsequent association studies for complex traits. Specifically, we identified two rare variants associated with hematological traits in AA and Hispanic/Latino populations and were able to validate our initial HBB association with WBC in an independent replication sample of sequenced individuals. In our study, we used EAGLE and minimac4 for imputation. We anticipate that the advantages of TOPMed as a reference panel also manifest when using alternative imputation methods. However, making TOPMed available as a reference panel compatible with each imputation method (e.g., corresponding recombination rate information) would be essential. In addition, computing time and memory usage should be taken into consideration as not all existing methods can scale to ~100 million markers in populations containing over thousands of individuals. TOPMed freeze 5b imputation is slightly more computationally intensive than use of the HRC reference panel (and takes nearly eight times longer than 1000G based imputation using the Michigan imputation server). However, we feel this increase in computational time is more than justified by the large number of additional well-imputed variants. We would note that the gains in imputation quality for AA and Hispanic/Latino populations using the TOPMed WGS reference panel likely do not apply to populations poorly represented in TOPMed freeze 5b (such as South Asians); future large-scale sequencing, including in later freezes of TOPMed, will improve imputation quality further across global populations. Future studies should also evaluate potential increases in statistical power for gene- and region- based tests using TOPMed imputed data. To demonstrate the potential gains, we have performed a targeted analysis of genes previously identified for their association with white blood cell count or hemoglobin/hematocrit levels in exome genotyping arrays or exome sequencing studies. We compared gene-based SKAT test results at these known loci using TOPMed freeze 5b based imputation to gene-based tests performed using 1000G and HRC reference panels. These results are presented in S18–S21 Tables and demonstrate that in both African ancestry and Hispanic/Latino populations more previously implicated genes from exome arrays or sequencing based studies were significant using TOPMed freeze 5b as an imputation reference panel versus 1000G phase 3 or HRC imputation. Further exploration of gene- and region-based tests is warranted in future studies, however. We expect the combination of high-quality imputation and higher depth sequencing datasets in larger cohorts of individuals will provide increased power for all rare variant association analyses in diverse populations in the near future.

Methods

Ethics statement

We here performed secondary data analysis on deidentified data only (exempt research). Access to TOPMed data was approved by the University of North Carolina at Chapel Hill Institutional Review board (study 16–2213). All individual studies included in TOPMed were approved by relevant local ethical review boards.

TOPMed 5b sequencing and phasing

The reference panel used for imputation was obtained from deep-coverage whole genome sequences derived from NHLBI’s TOPMed program (www.nhlbiwgs.org), freeze 5b (September 2017). This release included 54,035 non-duplicated, dbGaP released samples, of whom 50,253 have consent to be part of an imputation reference panel. The parent studies that contributed these 50,253 samples are listed in S2 Table. Specific to our analyses, freeze 5b includes 3,082 individuals from the Jackson Heart Study, who were removed from the reference panel for our analysis of imputation quality in this particular cohort. Overall, freeze 5b included 54% European ancestry, 26% AA, 10% Hispanic/Latino, 7% Asian, and 3% other ancestry samples. Detailed sequencing methods used in TOPMed are available at https://www.nhlbiwgs.org/topmed-whole-genome-sequencing-project-freeze-5b-phases-1-and-2. In brief, WGS with mean genome coverage ≥30x was completed at six sequencing centers (New York Genome Center, the Broad Institute of MIT and Harvard, the University of Washington Northwest Genomics Center, Illumina Genomic Services, Macrogen Corp., and Baylor Human Genome Sequencing Center). Sequence data files were transferred from sequencing centers to the TOPMed Informatics Research Center (IRC), where reads were aligned to human genome build GRCh38, using a common pipeline, and joint genotype calling was undertaken. Variants were filtered using a machine learning based support vector machine (SVM) approach, using variants present on genotyping arrays as positive controls and variants with many Mendelian inconsistencies as negative controls. After filtering potentially problematic variant sites, freeze 5b contained ~438 million single nucleotide polymorphisms and ~33 million short insertion-deletion variants. For our imputation analyses, we excluded from the reference panel variants with an overall allele count of 5 or less (leaving 88,062,238 variants in our reference panel, Table 1). Additional sample level quality control (such as detection of sex mismatches, pedigree discrepancies, sample swaps, etc.) was undertaken by the TOPMed Data Coordinating Center (DCC).

Genome-wide genotyping array data sets used for evaluation of imputation quality and/or phenotype association analysis

Hispanic Community Health Study/Study of Latinos (HCHS/SOL)

The HCHS/SOL cohort began in 2006 as a prospective study of Hispanic/Latino populations in the U.S. [40-42]. From 2008 to 2011, 16,415 adults were recruited from a random sample of households in four communities (the Bronx, Chicago, Miami, and San Diego). Each Field Center recruited >4,000 participants from diverse socioeconomic groups. Most participants self-identified as having Cuban, Dominican, Puerto Rican, Mexican, Central American, or South American heritage. The cohort has been genotyped both using an Illumina Omni2.5M array (plus 150,000 custom SNP, including ancestry-informative markers, Amerindian population specific variants, previously identified GWAS hits, and other candidate polymorphisms for a total of 2,293,715 SNPs) [43] and using the Illumina Multi-Ethnic Genotyping Array (MEGA) array (containing a total of 1,705,969 SNPs) in efforts from the Population Architecture for Genetic Epidemiology [44] consortium to better assess variation in non-European populations. The MEGA array also includes additional exonic, functional, and clinically-relevant variants. Illumina 2.5M array genotypes were available for 12,802 samples, among whom 11,887 samples also had MEGA array genotypes. The Illumina Omni2.5M array was used for imputation to the TOPMed reference panel, with the MEGA array treated as true genotypes for evaluation of imputation quality. For association analysis, imputation was performed on 11,887 samples after merging Omni2.5M array genotypes and MEGA array genotypes (MEGA genotypes were used for variants in both arrays, which resulted in 2,144,214 variants after quality control). Regional background (for evaluation of stratified imputation quality in S8 Table) was defined using both self-identified background and genetic markers, as described in [43]. For the hematological traits association analysis, 11,588 Hispanic/Latino participants were included.

Women’s Health Initiative

The Women’s Health Initiative (WHI) [45] is a long-term national health study focused heart disease, cancer, and osteoporotic fractures in older women. WHI originally enrolled 161,808 women aged 50–79 between 1993 and 1998 at 40 centers across the US, including both a clinical trial (including three trials for hormone therapy, dietary modification, and calcium/vitamin D) and an observational study arm. The recruitment goal of WHI was to include a socio-demographically diverse population with racial/ethnic minority groups proportionate to the total minority population of US women aged 50–79 years. This goal was achieved; a diverse population, including 26,045 (17%) women from minority populations, was recruited. Two WHI extension studies conducted additional follow-up on consenting women from 2005–2010 and 2010–2015. Genotyping was available on some WHI participants through the WHI SNP Health Association Resource (SHARe) resource, which used the Affymetrix 6.0 array (~906,600 SNPs, 946,000 copy number variation probes) and on other participants through the MEGA array [44]. Imputation and association analysis was performed separately in individuals with Affymetrix only, MEGA only, and both Affymetrix and MEGA data (S1 Table). For variants with both Affymetrix and MEGA genotypes available, MEGA genotypes were used. In total, 4,318 Hispanic/Latino and 8,494 AA women with blood cell traits were included.

UK Biobank

UK Biobank [46] recruited 500,000 people aged between 40–69 years in 2006–2010, establishing a prospective biobank study to understand the risk factors for common diseases such as cancer, heart disease, stroke, diabetes, and dementia). Participants are being followed-up through routine medical and other health-related records from the UK National Health Service. UK Biobank has genotype data on all enrolled participants, as well as extensive baseline questionnaire and physical measures and stored blood and urine samples. Hematological traits were assayed as previously described [14]. Genotyping on custom Axiom arrays and subsequent quality control has been previously described [47]. Samples were included in our analyses if ancestry self-report was “Black Carribean”, “Black African”,” Black or Black British”, “White and Black Carribean”, “White and Black African”, or “Any Other Black Background”. Variants were selected based on call rate exceeding 95%, HWE p-value exceeding 10−8, and MAF exceeding 0.5%. Subsequently, variants in approximate linkage equilibrium were used to generate ten principle components. Samples were excluded if the first principal component exceeded 0.1 and the second principal component exceeded 0.2, to exclude individuals not clustering with most African ancestry individuals. In total, 6,820 AA participants with blood cell traits were included in the analysis.

Genetic Epidemiology Research on Aging (GERA)

The GERA cohort includes over 100,000 adults who are members of the Kaiser Permanente Medical Care Plan, Northern California Region (KPNC) and consented to research on the genetic and environmental factors that affect health and disease, linking together clinical data from electronic health records, survey data on demographic and behavioral factors, and environmental data with genetic data. The GERA cohort was formed by including all self-reported racial and ethnic minority participants with saliva samples (19%); the remaining participants were drawn sequentially and randomly from non-Hispanic White participants (81%). Genotyping was completed as previously described [48] using 4 different custom Affymetrix Axiom arrays with ethnic-specific content to increase genomic coverage. Principal components analysis was used to characterize genetic structure in this multi-ethnic sample, as previously described [49]. Blood cell traits were extracted from medical records. In individuals with multiple measurements, the first visit with complete white blood cell differential (if any) was used for each participant. Otherwise, the first visit was used. In total, 5,783 Hispanic/Latino and 2,246 AA participants with blood cell traits were included in the analysis.

Jackson Heart Study (JHS)

JHS is a population-based study designed to investigate risk factors for cardiovascular disease in African Americans. JHS recruited 5,306 AA participants age 35–84 from urban and rural areas of the three counties (Hinds, Madison and Rankin) that comprise the Jackson, Mississippi metropolitan area from 2000–2004, including a nested family cohort (≥ 21 years old) and some prior participants from the Atherosclerosis Risk in Communities (ARIC) study [50, 51]. Genotyping was performed using an Affymetrix 6.0 array through NHLBI’s Candidate Gene Association Resource (CARe) consortium [52] in 3,029 individuals, with quality control described previously [53]. Due to the greater JHS sample size in TOPMed freeze 5b (n = 3,082), we extracted SNPs genotyped on Affymetrix 6.0 and which passed CARe consortium quality control in the non-duplicated JHS TOPMed sequenced samples included in the imputation reference panel (821,172 variants which passed TOPMed quality controls used for imputation).

Coronary Artery Risk Development in Young Adults (CARDIA)

The CARDIA study is a longitudinal study of cardiovascular disease risk initiated in 1985–86 in 5,115 AA and European ancestry men and women, then aged 18–30 years. The CARDIA sample was recruited at four sites: Birmingham, AL, Chicago, IL, Minneapolis, MN, and Oakland, CA [54, 55]. Similar to JHS, genotyping was performed through the CARe consortium [52, 53] using an Affymetrix 6.0 array. In total, 1,619 AA participants with blood cell traits were included in the analysis.

Atherosclerosis Risk in Communities (ARIC)

The ARIC study was initiated in 1987, when participants were 45–64 years old, recruiting participants age 45–64 years from 4 field centers (Forsyth County, NC; Jackson, MS; northwestern suburbs of Minneapolis, MN; Washington County, MD) in order to study cardiovascular disease and its risk factors [56], including the participants of self-reported AA ancestry included here. Standardized physical examinations and interviewer-administered questionnaires were conducted at baseline (1987–89), three triennial follow-up examinations, a fifth examination in 2011–13, and a sixth exam in 2016–2017. Genotyping was performed through the CARe consortium Affymetrix 6.0 array [52, 53]. In total, 2,392 AA participants with blood cell traits were included in the analysis.

Imputation and post-imputation quality filtering

We first phased individuals from each cohort separately using eagle [57] with default settings. We subsequently performed haplotype-based imputation using minimac4 [58] using phased haplotypes from TOPMed freeze 5b as reference. We used 100,506 TOPMed freeze 5b whole genome sequences as reference for all cohorts except JHS, for which we used 94,342 TOPMed freeze 5b non-JHS sequences. We additionally imputed HCHS/SOL and JHS using 1000 Genomes Phase 3 [9] and HRC [8] reference panels. Post-imputation quality filtering was performed using a R2 threshold specific to each MAF category to ensure average R2 for variants passing threshold was at least 0.8, following our previous work [4, 59]. Restricting to variants passing post-imputation quality control in at least two cohorts resulted in 34.4–35.8 million variants assessed in the AA cohorts and 26.7–27.2 million assessed in the HA cohorts, depending on the exact sample size of the tested trait. Imputation and association analysis included autosomal variants only. We assessed imputation quality (comparing true and estimated average R2) in three selected 3Mb regions: 16-19Mb region (relative to the start of each chromosome) from chromosomes 3, 12, and 20. Example scripts for imputation quality control are available at https://yunliweb.its.unc.edu/topmed5bimputation/index.php.

Hematological traits

HGB, HCT, WBC and differential were measured in both the discovery data sets (S9 and S16 Tables) and a subset of the TOPMed freeze 5b samples (S10 and S17 Tables) using automated clinical hematology analyzers. Prior to association analyses, we excluded extreme outlier values, notably WBC values >200x109/L (as well as WBC subtype count values in these individuals), HCT >60%, and HGB >20g/dL. For longitudinal cohort studies, all values are from the same exam cycle, chosen based on largest available sample size. WBC traits were log transformed due to their skewed distribution. For all traits, we first derived trait residuals adjusting for age, age squared, sex, and principal components/study specific covariates as needed. Trait residuals were then inverse-normalized prior to analysis.

Association analysis in discovery cohorts

Association analyses were carried out for these variants via EPACTS for all cohorts except for HCHS/SOL, using the q.emmax test to account for relatedness within each cohort. Association tests were performed on inverse normalized residuals (adjusted for age, age squared, sex, and principal components/study specific covariates), further adjusting for kinship matrices constructed in EPACTS using variants with a MAF>1%. Individuals with different starting genotyping platform(s) were also analyzed separately. Inverse-variance weighted meta-analysis were further carried out using GWAMA [60], separately for AAs and Hispanics/Latinos.

Identification and replication of novel associations

To identify putative novel associations, we then filtered out any variant with LD r2 ≥ 0.2 in any ethnic group with any previous reported variant from GWAS, sequencing, or Exome Chip analyses within ±1Mb for a given blood cell trait. We calculated LD in self-reported European ancestry, AA, and Hispanic/Latino individuals from TOPMed freeze 5b. For European and African LD reference panels, we further restricted to individuals with global ancestry estimate ≥0.8. The global ancestry estimates were derived from local ancestry estimates from RFMix [61] using data from the Human Genome Diversity Project (HGDP) [62] as the reference panel with seven populations, namely Sub-Saharan Africa, Central and South Asia, East Asia, Europe, Native America, Oceania, and West Asia and North Africa (Middle East). Global ancestry for each TOPMed individual is defined as the mean local ancestry across all HGDP SNPs. For replication of novel signals, similar to the approach we adopted for the discovery cohorts, we performed association analysis using EPACTS in each contributing cohort and then meta-analyzed with GWAMA.

Comparison of imputation reference panels, for variants with MAF > 1%.

Comparison of well imputed variants included in results from TOPMed (TOPMed freeze 5b), 1000G (the 1000 Genomes Phase 3), and HRC (the Haplotype Reference Consortium).