Literature DB >> 28548082

Whole genome sequencing and imputation in isolated populations identify genetic associations with medically-relevant complex traits.

Lorraine Southam^1,2, Arthur Gilly¹, Dániel Süveges¹, Aliki-Eleni Farmaki³, Jeremy Schwartzentruber¹, Ioanna Tachmazidou¹, Angela Matchan¹, Nigel W Rayner^1,2,4, Emmanouil Tsafantakis⁵, Maria Karaleftheri⁶, Yali Xue¹, George Dedoussis³, Eleftheria Zeggini¹.

Abstract

Next-generation association studies can be empowered by sequence-based imputation and by studying founder populations. Here we report ∼9.5 million variants from whole-genome sequencing (WGS) of a Cretan-isolated population, and show enrichment of rare and low-frequency variants with predicted functional consequences. We use a WGS-based imputation approach utilizing 10,422 reference haplotypes to perform genome-wide association analyses and observe 17 genome-wide significant, independent signals, including replicating evidence for association at eight novel low-frequency variant signals. Two novel cardiometabolic associations are at lead variants unique to the founder population sequences: chr16:70790626 (high-density lipoprotein levels beta -1.71 (SE 0.25), P=1.57 × 10-11, effect allele frequency (EAF) 0.006); and rs145556679 (triglycerides levels beta -1.13 (SE 0.17), P=2.53 × 10-11, EAF 0.013). Our findings add empirical support to the contribution of low-frequency variants in complex traits, demonstrate the advantage of including population-specific sequences in imputation panels and exemplify the power gains afforded by population isolates.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28548082 PMCID： PMC5458552 DOI： 10.1038/ncomms15606

Source DB: PubMed Journal: Nat Commun ISSN： 2041-1723 Impact factor: 14.919

Genome-wide association studies (GWAS) for complex medical traits have to date been designed and powered for the discovery of common-frequency variants with small to modest effect sizes. It is becoming increasingly clear that rare and low-frequency variants also play an important role1. Utilizing a large reference panel can greatly improve GWAS imputation accuracy2, capturing a slice of the previously unattainable allelic architecture. Isolated populations can additionally help expedite the identification of low-frequency variants affecting complex traits. The founding event can lead to an increase in allele frequency due to genetic drift, thereby boosting power for GWAS. Here, we use GWAS and exome chip data from the Mylopotamos (MANOLIS) and Pomak villages cohorts as a scaffold (Fig. 1), and impute up to a large reference panel of 5,122 individuals, including 249 MANOLIS samples sequenced at 4 × depth. This is the first time WGS data have been generated in this population. We examine 13,541,454 and 15,514,754 single nucleotide variants (SNVs) with minor allele count (MAC)≥2 in the Pomak and MANOLIS cohorts, respectively, and test for association with traits of cardiometabolic relevance. To enable meta-analysis across potentially related individuals, we implement a method that accounts for non-independence across strata and demonstrate its robustness. We identify eight novel signals for traits of medical relevance.

Figure 1

Flowchart of study design.

The HELIC cohorts were prephased, imputed and analysed separately by cohort and array, and finally meta-analysed. The variant numbers reported here are total regardless of MAF. Imputed variants are for chromosomes 1–22.

Results

Genetic architecture of Cretan population

We generated whole-genome sequence data at 4 × depth in 249 MANOLIS individuals selected on the basis of genome-wide genotype data to maximize haplotype diversity in the population. To characterize the variation landscape in this isolated population, we aggregated the proportion of SNVs captured across the genome (total n=9,554,503 with MAC≥2) by functional class (Fig. 2a and Supplementary Table 1) and found that variant densities are inversely correlated with ascribed functional importance. Highest densities are observed in intergenic regions, while coding and splice regions, where disrupting variants may have more severe consequences, are sparsest, in line with observations in other populations1345. As expected, we also find that variants with more severe consequences are present in a higher proportion at the lower end of the minor allele frequency (MAF) spectrum compared to the genomic average (Supplementary Fig. 1 and Supplementary Table 2).

Figure 2

Variant sharing and functional annotation.

(a) SNP density per kbp and percentage of total per functional class, based on 9,554,503 variants identified in the HELIC MANOLIS 4 × WGS data of 249 samples (MAC≥2). Error bars indicate standard error of the mean; the dashed red line indicates average density genome-wide. (b) Variant overlap between 498 HELIC MANOLIS, 7,582 UK10K and 2,184 1000 Genomes Project reference panel haplotypes, by MAF category. Numerical values are given in Supplementary Tables 1 and 2.

Of all autosomal SNVs found in the MANOLIS 4 × WGS data 0.52 million (5.81%) were unique compared to the UK10K and 1000 Genomes Project reference panels (Fig. 2b and Supplementary Table 3). Most variants unique to MANOLIS were low-frequency and rare, in fact, the rarer a variant was, the more likely it was to be unique to MANOLIS, with 32% of doubletons being unique. To explore the functionality of these variants, we compared, for each MAF bin, the proportion of unique and shared variants belonging to any given functional consequence in that class, and found that rare variants are more likely to be unique to MANOLIS if they belong to a severe functional class. In particular, we find significant (P<1.00 × 10−10) enrichment of rare and low-frequency (MAF≤5%) coding and regulatory region variants (Fig. 3 and Supplementary Table 4), which is expected when comparing shared, older variants with newer, cohort-specific ones which haven't yet fully undergone purifying selection. We also find a significant but modest depletion (P<1.00 × 10−5) of variants annotated as intergenic and upstream/downstream.

Figure 3

Functional enrichment of variants private to the MANOLIS sequences when compared to variants shared with UK10K and/or 1000 Genomes.

Enrichment and depletion of functional classes of variants private to the MANOLIS cohort can be observed in the rare and low-frequency (MAF≤5%), while no significant enrichment is detected in common-frequency variants in any functional class. Numerical values are listed in Supplementary Table 4.

Meta-analysis using METACARPA

The MANOLIS (n=1,476) and Pomak (n=1,737) cohorts were each genotyped in two tranches (Fig. 1), leading to a requirement for within-cohort meta-analysis. Existing methods to correct for sample relatedness (which is a distinct possibility when meta-analysing within an isolated population) or overlap are based on Pearson's correlation of the z-scores6, but this can lead to overcorrection in the presence of a large polygenic burden7. On the other hand, a meta-analysis method that uses tetrachoric correlation7 and combines P with effect-size based methods8 can account for non-independence of samples across strata. We implemented the method in openly-available new software, METACARPA (https://github.com/wtsi-team144/metacarpa and http://www.sanger.ac.uk/science/tools/metacarpa). Using simulation, we show that under typical levels of sample overlap (up to 10%) METACARPA reduces false-positive rate inflation by 8%, while conserving power (Fig. 4a-4b, and Methods). We infer a tetrachoric correlation of 1.96% between P values in the two MANOLIS datasets, and a correlation of 1.84% between the two Pomak datasets. Those values reflect an average within-cohort, cross-dataset kinship of 0.43 and 0.33%, respectively, as measured by pi-hat.

Figure 4

False-positive rate and meta-analysis power in the presence of sample overlap using METACARPA.

(a) Empirical false-positive rate as a function of sample overlap in 1,000 repeats of a meta-analysis of two studies including 2,000 samples each, at a significance threshold of 5 × 10−8. (b) Empirical power of the four tests implemented in METACARPA as a function of sample overlap in the same simulation setting. Power is calculated as the discovery rate of a SNP explaining 1% of a standard normal phenotype under the same simulation scenario (for example, a MAF of 1% and an effect size of 0.705, or a MAF of 20% and an effect size of 0.176). (c) Compared accuracy of Digby's estimate of tetrachoric correlation and Pearson's correlation for a true (dashed line) 25% overlap under a polygenic burden, with 10,000 SNPs affecting a quantitative trait with 20% heritability. Estimates of correlation for both methods are calculated over 300 genome-wide simulations. The black line indicates the median, shaded rectangles represent the interquintile ranges.

We also compared METACARPA to a genotype-level mega-analysis (Fig. 4a and b, Supplementary Fig. 2). When individual level data are available, a global analysis that takes dataset provenance into account and where overlapping samples are removed maintains the type-I error rate at nominal significance. The power of such a global mega-analysis drops markedly as sample overlap increases, although it is more powerful than summary-statistic level meta-analyses when no or little overlap is present. When only summary-level statistics are available, METACARPA provides the advantage of a lower false-positive rate than a naïve meta-analysis under typical levels of overlap (0–10%), although it does not control type-I error to nominal levels. Meanwhile, power is conserved compared to the naïve meta-analysis, and is higher than for a sample-level global analysis. As expected, the tetrachoric estimate of overlap is more robust than Pearson's correlation to an excess of signal in the meta-analysed studies (Fig. 4c). Furthermore, for the HELIC MANOLIS data, we compared the results produced by METACARPA to a mega-analysis of the genotype-level data, as well as a summary-level meta-analysis not accounting for relatedness using the GWAMA software, and found similar median statistics (λ=0.985±0.015) for association with high-density lipoprotein (HDL) (Supplementary Fig. 3). We conclude that all three meta-analysis methods were robust to the moderate levels of relatedness observed between the datasets of the HELIC study.

Signals associated with traits of medical importance

We investigated 13 cardiometabolic, 9 anthropometric and 9 haematological traits of medical relevance, and report here genome-wide significant signals (P≤5.00 × 10−8) that replicate within (nominal significance and the same direction of effect for each array in a cohort) or across the isolates studied (nominal significance and the same direction of effect in MANOLIS and Pomak). We identify 9 previously-reported GWAS signals (Table 1 and Supplementary Note 1) and 8 novel, internally replicating associations (Table 2, Fig. 5 and Supplementary Fig. 4), which all validate when directly genotyped in the same samples using a different genotyping assay (Supplementary Table 5). None of these novel variants are present in the HapMap haplotypes (http://hapmap.ncbi.nlm.nih.gov), 5 do not have HapMap proxies (r2>0.8), and 3 are not present in the 1000 Genomes Project reference panel haplotypes. Three signals were identified in MANOLIS, four in Pomak and one across both isolated populations. Five signals fall just above our Bonferroni-adjusted genome-wide significance threshold (P<3.33 × 10−9) for the effective number of traits tested and are therefore considered tentative. When assessing the fine-mapping potential of these data (Supplementary Methods), we find that the median number of variants in the 95% credible sets is 6.5 and that the median interval length is 546 kbp. This is in line with the expectation of extended LD in founder populations.

Table 1

Summary statistics at established loci.

Variant and cohorts	Trait	Chr:pos (EA/NEA)	Variant consequence	EAF	Beta (SE)	P-value	N	Reported variant	Reported genes	Reported PMID	Conditional P
rs7412 MANOLIS & Pomak	LDL	9:45412079 (T/C)	Missense	0.079	−0.419 (0.047)	2.64 × 10⁻¹⁹	3168	rs7412	APOC1, APOC2, APOE	22286219	NA
	TC		p.Arg176Cys	0.079	−0.27 (0.047)	1.05 × 10⁻⁸	3170

rs7553007 MANOLIS & Pomak	CRP	1:159698549 (A/G)	Intergenic	0.327	−0.202 (0.029)	6.80 × 10⁻¹²	2689	rs7553007	CRP	19567438	NA

rs964184 MANOLIS & Pomak	VLDL	11:116648917 (G/C)	3′ UTR	0.163	0.242 (0.035)	3.68 × 10⁻¹²	3170	rs964184	APOA1	24097068	NA
	TG			0.163	0.236 (0.035)	1.52 × 10⁻¹¹	3164

rs76353203 MANOLIS	TG	11:116701353 (T/C)	Stop-gain	0.022	−1.073 (0.129)	6.88 × 10⁻¹⁷	1461	rs76353203	APOC3	24343240	NA
	HDL		p.Arg19Ter	0.022	0.919 (0.13)	1.78 × 10⁻¹²	1465

rs150641967 MANOLIS & Pomak	LDL	19:19370340 (T/TGACA)	Intronic	0.075	−0.326 (0.049)	3.49 × 10⁻¹¹	3168	rs10401969	CILP2	24097068	9.34 × 10⁻¹
	TC			0.074	−0.322 (0.046)	8.29 × 10⁻¹¹	3170				8.71 × 10⁻¹
	TG			0.074	−0.278 (0.05)	2.49 × 10⁻⁸	3164				3.94 × 10⁻¹
	VLDL			0.075	−0.282 (0.05)	1.48 × 10⁻⁸	3170				3.51 × 10⁻¹

rs35237252 MANOLIS & Pomak	HDL	8:19870271 (A/C)	Regulatory region	0.277	0.183 (0.029)	4.04 × 10⁻¹⁰	3172	rs2083637	LPL	19060911	1.39 × 10⁻¹

rs200751500 MANOLIS & Pomak	HDL	16:57001274 (A/AC)	Intronic	0.33	0.294 (0.028)	4.02 × 10⁻²⁵	3172	rs1532624	CETP	19060911	1.18 × 10⁻⁴

rs1331309 MANOLIS & Pomak	MCH	6:135406178 (G/T)	Intronic	0.228	0.201 (0.033)	1.90 × 10⁻⁹	2829	rs7775698	MYB, HBS1L	20139978	3.59 × 10⁻¹

rs9804550 Pomak	WBC	11:5186093 (T/C)	Intronic	0.051	0.52 (0.081)	1.10 × 10⁻¹⁰	1673	rs7116019	TRIM68	25373335	5.33 × 10⁻⁶
	MCH			0.053	−0.627 (0.079)	2.19 × 10⁻¹⁵	1647				1.43 × 10⁻²
	MCHC			0.054	0.894 (0.075)	8.46 × 10⁻³³	1669				1.46 × 10⁻⁴
	MCV			0.052	−1.071 (0.076)	1.57 × 10⁻⁴⁵	1658				2.71 × 10⁻⁵
	RBC			0.054	0.473 (0.077)	8.58 × 10⁻¹⁰	1718				3.56 × 10⁻²

Lead variants for validated, previously-reported association signals reaching P<5.00 × 10−8. Cohorts, cohorts from which the signal arose; Chr:pos, represents the chromosome & position in GRCh37/hg19 coordinates; Variant consequence, taken from Ensembl (http://www.ensembl.org) the Human Genome Variation Society variant nomenclature (http://www.HGVS.org/varnomen) are provided for exonic variants. The other abbreviations are: EA, effect allele; NEA, non-effect allele; EAF, effect allele frequency; P, the Wald test P-value from the association analysis using METACARPA; N, sample size; Reported variant, RS-id of the reported signal; Reported genes, the gene(s) in which the signal was reported; reported PMID, PubMed ID for the reported GWAS signal; Conditional P, Wald test P from the association analysis using METACARPA of the variant after conditioning on the reported variant, confirming the signals are conditionally dependent; NA, indicates that conditional analysis is not applicable since the variant is the same as the reported variant; LDL, low-density lipoprotein cholesterol; TC, total cholesterol; CRP, C-reactive protein; VLDL, very low-density lipoprotein cholesterol; TG, triglycerides; HDL, high-density lipoprotein cholesterol; MCH, mean corpuscular haemoglobin; WBC, white blood cells; MCHC, mean corpuscular haemoglobin concentration; MCV, mean corpuscular volume; RBC, red blood cells.

Table 2

Summary of novel association signals.

Variant and cohorts	Trait	Chr:pos (EA/NEA)	Nearest gene	Internal replication					EAF	Beta (SE)	P-value	Overall MAC (N)
				Replication cohorts	EAF	Beta (SE)	P-value	MAC (N)
chr16:70790626 MANOLIS	HDL	16:70790626 (T/C)	VAC14-AS1 VAC14	MANOLIS CoreExome	0.003	−1.885 (0.994)	5.76 × 10⁻²	1.26 (210)	0.006	−1.713 (0.254)	1.57 × 10⁻¹¹	20 (1476)
				MANOLIS OmniExome	0.007	−1.702 (0.263)	1.81 × 10⁻¹⁰	17.6 (1255)

rs145556679 MANOLIS	TG	11:117643264 (C/G)	DSCAML1	MANOLIS CoreExome	0.005	−1.293 (0.729)	7.85 × 10⁻²	2.09 (209)	0.013	−1.134 (0.17)	2.53 × 10⁻¹¹	49 (1476)
				MANOLIS OmniExome	0.014	−1.125 (0.175)	1.70 × 10⁻¹⁰	35.1 (1252)
	VLDL			MANOLIS CoreExome	0.005	−1.365 (0.727)	6.21 × 10⁻²	2.1 (210)	0.013	−1.131 (0.17)	2.90 × 10⁻¹¹
				MANOLIS OmniExome	0.014	−1.118 (0.175)	2.29 × 10⁻¹⁰	35.1 (1253)

rs140087759 MANOLIS	WHR	5:28292892 (T/C)	CTD-2061E9.1	MANOLIS CoreExome	0.015	1.676 (0.411)	5.92 × 10⁻⁵	6.12 (204)	0.01	1.189 (0.209)	1.35 × 10⁻⁸	31 (1476)
				MANOLIS OmniExome	0.009	1.02 (0.243)	2.90 × 10⁻⁵	18.8 (1047)

rs13382259* Pomak	DBP	2:113934176 (T/A)	PSD4	Pomak CoreExome	0.047	0.509 (0.126)	6.98 × 10⁻⁵	60.3 (641)	0.043	0.554 (0.1)	3.18 × 10⁻⁸	172 (1737)
				Pomak OmniExome	0.039	0.629 (0.164)	1.36 × 10⁻⁴	43 (551)

rs6131100* Pomak	FGBMIadj	20:10434530 (A/T)	SLX4IP	Pomak CoreExome	0.038	−0.573 (0.16)	3.62 × 10⁻⁴	43.2 (569)	0.037	−0.79 (0.139)	1.21 × 10⁻⁸	135 (1737)
				Pomak OmniExome	0.035	−1.454 (0.279)	7.12 × 10⁻⁷	12.2 (174)

rs79748197 Pomak	WBC	2:19430105 (G/A)	AC092594.1	Pomak CoreExome	0.004	−1.242 (0.403)	2.12 × 10⁻³	5.8 (725)	0.008	−1.156 (0.209)	3.00 × 10⁻⁸	31 (1737)
				Pomak OmniExome	0.004	−1.125 (0.243)	4.14 × 10⁻⁶	20.9 (948)

rs557129696 Pomak	HGB	11:5328683 (G/T)	HBG2 HBE1 AC104389.28	Pomak CoreExome	0.002	−1.95 (0.606)	1.36 × 10⁻³	2.87 (717)	0.004	−2.027 (0.308)	4.83 × 10⁻¹¹	13 (1737)
				Pomak OmniExome	0.005	−2.054 (0.358)	1.30 × 10⁻⁸	9.45 (945)

rs112037309* MANOLIS & Pomak	Weight	4:106617136 (A/G)	ARHGEF38 INTS12	MANOLIS	0.075	0.295 (0.078)	1.43 × 10⁻⁴	189.8 (1258)	0.075	0.287 (0.052)	2.70 × 10⁻⁸	485 (3213)
				Pomak	0.075	0.28 (0.07)	5.96 × 10⁻⁵	250.8 (1672)

All variants are intronic with the exception of rs140087759 which is intergenic, variant consequences are taken from Ensembl (http://www.ensembl.org). For the internal replication the software used was GEMMA with the exception of rs112037309 in which METACARPA was used. Cohorts, cohorts from which the signal arose. Chr:pos, represents the chromosome and position in GRCh37/hg19 coordinates; EA, effect allele; NEA, non-effect allele; EAF, effect allele frequency; P-value, the likelihood ratio test P-value from GEMMA or Wald test P-value from METACARPA; MAC, minor allele count for samples in the analysis; Overall MAC, minor allele count for all samples in the cohorts from which the signal arose, established using the rounded imputed allele dosages from SNPTEST (https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html); N, sample size; HDL, high-density lipoprotein cholesterol; DBP, diastolic blood pressure; TG, triglycerides; VLDL, very low-density lipoprotein cholesterol; FGBMIadjusted, fasting glucose adjusted for body mass index; HGB, haemoglobin; WBC, white blood cells; WHR, waist-to-hip ratio.

*At least one proxy is present in the International HapMap project data (http://hapmap.ncbi.nlm.nih.gov). Proxies were determined using LD (r2>0.8 in the cohorts used for the meta-analysis) for each novel variant. If a proxy was in HapMap it also had high LD (r2>0.9) with the variant in the 1000 Genomes Project CEU population3. LocusZoom was used to create the regional plots (http://csg.sph.umich.edu/locuszoom/).

Figure 5

Association results for chr16:70790626 and rs145556679 and lipid levels.

(a) Heterozygotes for chr16:70790626 exhibit significantly lower HDL levels than homozygotes (Wald test METACARPA P=1.57 × 10−11). (b) Heterozygotes for rs145556679 exhibit significantly lower TG (Wald test METACARPA P=2.53 × 10−11) and VLDL (Wald test METACARPA P=2.90 × 10−11) levels than homozygotes. (c) Regional association plot for chr16:70790626. (d) To determine if the signals are detected without MANOLIS sequences in the reference panel, we conducted imputation using a combined UK10K+1000 Genomes reference panel; the regional plot shows that the chr16:70790626 signal is captured with a different lead variant and a decrease in significance. (e) Regional association plot for rs145556679. (f) Regional association plot for rs145556679 using a combined UK10K+1000 Genomes reference panel; the same signal is captured with a different lead variant and a decrease in association strength. LocusZoom was used to create the regional plots (http://csg.sph.umich.edu/locuszoom/).

We identify a new association with HDL cholesterol at chr16:70790626 (beta −1.71 (SE 0.25), P=1.57 × 10−11, effect allele frequency (EAF) 0.006) (Table 2 and Fig. 5). This variant is present in the MANOLIS sequences only. When MANOLIS sequences are not included in the reference panel a reduced signal is observed at a different variant (Fig. 5c,d). This is the strongest new signal for HDL in MANOLIS and explains 3.24% of the phenotypic variance. Chr16:70790626 resides in intron 11 of the VAC14 gene. The encoded protein is involved in the regulation of phosphatidylinositol 3,5-bisphosphate levels and the biogenesis of endosome carrier vesicles910. In animal models, knocking out Vac14 causes death within 2 days of birth caused by the disruption of phosphatidylinositol metabolism11. Seventy per cent of chr16:70790626 carriers are from the Anogia village. The average relatedness (pairwise ) is significantly higher in carriers (empirical P=0.006 from 100,000 permutations), who are on average 11 years younger than non-carriers, P=6.00 × 10−3 (Supplementary Table 6). We identify a cardioprotective signal (rs145556679, EAF 0.013), which is associated with decreased triglycerides (TG) (beta −1.13 (SE 0.17), P=2.53 × 10−11) and with very low-density lipoprotein cholesterol (VLDL) levels (beta −1.13 (SE 0.17), P=2.90 × 10−11) (Table 2 and Fig. 5). This variant is not seen in any other worldwide cohort in the 1000 Genomes Project except for a single heterozygote reported in Toscani in Italia (TSI) samples (n=107, MAF=0.005) (Supplementary Table 7). However, as singletons were filtered out of the reference WGS data prior to phasing, rs145556679 is only represented in the MANOLIS sequences in the reference panel. Variants in LD with rs145556679 are present in haplotypes from other reference panel populations and a reduced signal for a different variant is detected when MANOLIS sequences are not included in the reference panel (Fig. 5e,f). This variant is the strongest new signal for TG/VLDL in MANOLIS and explains 3.21% and 3.20% of TG and VLDL variance, respectively. rs145556679 is located 942 kbp downstream of rs76353203 (APOC3 R19X, previously associated with lipid traits1213, r2=0.001). Conditional analysis confirms that rs145556679 is independent of R19X (TG, Pcond1.09 × 10−12; VLDL, Pcond1.22 × 10−12) (Supplementary Table 8). rs145556679 resides within an intron of the Down syndrome cell adhesion molecule like 1 (DSCAML1) gene, which is involved in cell adhesion in neuronal processes and is expressed in heart, liver, pancreas, skeletal muscle, kidney and brain1415. An independent variant in this gene (rs10892151, 112 kbp away from rs145556679, r2=0.0005 in MANOLIS) has previously been implicated with TG levels in the Amish founder population13. In MANOLIS we also observe an association between waist-to-hip ratio (WHR) and rs140087759 (beta 1.19 (SE 0.21), P=1.35 × 10−8, EAF 0.010) (Table 2), located 5 kbp upstream of the long non-coding RNA gene CTD-2061E9.1. The signal is not associated with WHR in the Pomak population (P=0.39), and has a higher frequency in the Pomak (MAF 0.038) and 1000 Genomes Project EUR populations (MAF 0.014) compared to MANOLIS (MAF 0.01). rs140087759 has no proxies with r2>0.8 in MANOLIS and is not present in the WHR GWAS summary statistics from the Genetic Investigation of ANthropometric Traits (GIANT) study (https://www.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files)16. In the Pomak population, we identify an association between diastolic blood pressure (DBP) and rs13382259 (beta 0.55 (SE 0.1), P=3.18 × 10−8, EAF 0.043) (Table 2), which resides in a predicted promoter (ENSR00000596922)17 in an intron of PSD4. It is located 1 kbp upstream of novel transcript AC016683.5. rs13382259 is associated with the expression of PAX8 in tibial nerve (GTEx Portal, http://www.gtexportal.org). The allele frequency of rs13382259 is lower in the MANOLIS (MAF 0.024) compared with the 1000 Genomes Project EUR populations (MAF 0.05) and the Pomak population (MAF 0.05). The signal is not associated in MANOLIS (P=0.53) and is not present in the genome-wide summary statistics for the International Consortium for Blood Pressure (ICBP)18. Proxies for rs13382259 (r2>0.8) are present in the International HapMap Project data (http://hapmap.ncbi.nlm.nih.gov) and three are present in ICBP summary statistics but none were significantly associated with DBP. We also identify an association between fasting glucose levels adjusted for BMI (FGBMIadj) and rs6131100 (beta −0.79 (SE 0.14), P=1.21 × 10−8, EAF 0.037) (Table 2). rs6131100 is situated in the intron of SLX4IPA and 20 kbp upstream of MKKS, which is associated with Bardet–Biedl syndrome 6 (OMIM: 605552). The allele frequency of rs6131100 is higher in the MANOLIS (MAF 0.083) and 1000 Genomes Project EUR populations (MAF 0.053) compared to the Pomak population (MAF 0.039). rs6131100 is not associated with FGBMIadj in MANOLIS (P=0.91), and is not present in genome-wide summary data available from the Meta-Analyses of Glucose and Insulin-related traits Consortium (MAGIC) study (www.magicinvestigators.org)192021. One proxy for rs6131100 was present in the International HapMap Project but this did not show evidence of association in the MAGIC genome-wide summary data for FGBMIadj. In the Pomak cohort, we also observe an association with white blood cell count (WBC) and rs79748197 (beta −1.16 (SE 0.21), P=3.00 × 10−8, EAF 0.008) (Table 2), which resides in the intron of a non-coding transcript (AC092594.1). The closest protein-coding gene is OSR1, 121 kbp away, a widely-expressed transcription factor implicated in embryonic heart, kidney and urogenital development22. rs79748197 has a similar frequency in MANOLIS and is not associated with WBC (P=0.19). It has a higher allele frequency in the 1000 Genomes Project EUR population (MAF 0.014). No proxies are present for rs79748197 in the Pomak population and this trait was not examined in the Haemgen RBC study23. Lastly, rs557129696 is associated with haemoglobin levels (HGB) (beta −2.03 (SE 0.31), P=4.83 × 10−11, EAF 0.004) (Table 2). The variant resides in an extended LD region spanning 1.4 Mb. The signal is conditionally independent (Supplementary Table 8) of previously-reported blood trait GWAS signals in this region in the same population24. rs557129696 resides in the intronic regions of two haemoglobin-coding genes (MBE1 and MBG1) and a non-coding RNA gene (AC104389.28). The G-allele of rs557129696 is not seen in the 1000 Genomes EUR population. Numerous associations with red blood cell traits, anaemia and thalassemias have been linked to this chromosome 11 region25262728. We have previously observed an independent signal associated with blood traits in this chromosomal region of extended LD24. Notably, associations between variants in this region and foetal haemoglobin levels29 have been reported in the Sardinian founder population. Weight was associated with rs112037309 (beta 0.29 (SE 0.05), P=2.70 × 10−8, EAF 0.075) in both isolated populations (Table 2). rs112037309 is situated in the intronic regions of ARHGEF38 and INTS12. The protein product of ARHGEF38 is involved in signal transduction, and immunohistochemistry shows strong staining in pancreatic islets, skeletal and smooth muscle (Protein atlas, http://www.proteinatlas.org). rs112037309 has a higher frequency in the 1000 Genomes Project EUR population (MAF 0.096) compared with the Pomak (MAF 0.073) and MANOLIS (MAF 0.074) populations. We were unable to look up this variant in large GWAS studies as weight is not one of the traits included as part of the Genetic Investigation of Anthropometric Traits (GIANT)16 study.

Discussion

We provide here a first characterization of the genetic architecture of the MANOLIS isolated population and report 9.5M SNVs, of which 6% are absent from previous sequenced panels and are enriched for predicted functional consequences. Our complex trait association findings highlight the advantages of whole genome sequencing in founder populations: two lipid traits and the HGB signals we identify are driven by variants unique to the MANOLIS cohort or extremely rare in other worldwide populations. The remaining five novel associations are present in European populations (1000 Genomes Project EUR MAF ranging from 0.014 to 0.096) but are not significantly associated in GWAS meta-analyses of cosmopolitan populations. This can be due to a number of reasons in addition to winner's curse, that is, larger effect sizes in the discovery isolate cohort. For two of these signals, the variant and its proxies are not present in the HapMap reference panel and therefore these variants are not represented in GWAS conducted to date. Three of the associated variants are represented in HapMap and show no evidence of association outside the isolate; this can indicate that the index variant is in LD with the causal variant in the isolate but not in the cosmopolitan population. Furthermore, the effect and therefore the power to detect associations can be increased in isolates due to the environmental and phenotypic homogeneity when compared to other worldwide populations, in addition to extended LD. Our study demonstrates the power benefits of using a large, sequence-based imputation reference panel. Six of the eight new associations reported here have been detected for the first time as a consequence of improved imputation accuracy. The cost-effective hybrid WGS and imputation approach in founder populations serves as a good model for further low-frequency variant discovery, which can enhance our understanding of the biological processes underpinning complex traits of medical relevance.

Methods

Cohort descriptives

The HELIC (Hellenic Isolated Cohorts; www.helic.org) MANOLIS (Minoan Isolates) collection focuses on Anogia and surrounding Mylopotamos villages on the Greek island of Crete. All individuals had to have at least one parent from the Mylopotamos area. The HELIC Pomak collection focuses on the Pomak villages, a set of isolated mountainous villages in the North of Greece. Recruitment of both population-based samples was primarily carried out at the village medical centres. The study includes biological sample collection for DNA extraction and lab-based blood measurements, and interview-based questionnaire filling. The phenotypes collected include anthropometric and biometric measurements, clinical evaluation data, biochemical and haematological profiles, self-reported medical history, demographic, socioeconomic and lifestyle information. The study was approved by the Harokopio University Bioethics Committee and informed consent was obtained from every participant.

HELIC MANOLIS sequencing data

Sample selection. Overall, 250 HELIC MANOLIS samples were whole-genome sequenced at 4 × depth to provide reference haplotypes for imputation. To maximize haplotype diversity, the 250 most unrelated samples were selected from a set of 1,118 samples genotyped on the Illumina OmniExpress chip. Common (MAF>5%) variants were used to calculate the pairwise identity by descent; there were 624,403 such pairs for 1,118 individuals. Samples were selected based on a maximal of 0.15 across all samples. Sequencing and variant calling. Sequencing was performed at 4 × average depth using Illumina HiSeq 2000 sequencers. The data was aligned to the 1000 Genomes Project phase 1 reference assembly using BWA30. Optical and PCR duplicates were removed using Picard MarkDuplicates (http://broadinstitute.github.io/picard). Variants were called using samtools31 mpileup, and quality score recalibration was performed using the variant quality score recalibration (VQSR) tool from the GATK32 v.2.1.13 suite. After recalibration, one sample was found to be an ethnic outlier and was removed. No samples were excluded based on concordance checks with genotype data, sex checks, mean depth per sample, heterozygous or singleton rate per sample or non-reference allele (NREF) discordance. Variant QC and Haplotype creation. Post-VQSR, variants were filtered so as to yield a sensitivity threshold of 90% for INDELS (VQSLOD<3.1159) and a threshold of 94% for SNPs (VQSLOD<5.4079).Variants were excluded if they were multi-allelic, monomorphic, singletons, indels, had a missingness >3% or a HWE P<1.00 × 10−4. Any variant from the HELIC MANOLIS data for which the alleles differed from the 1000 Genomes Project and UK10K datasets at the same position was excluded. Phasing was performed using SHAPEIT v2.r727 (ref. 33). Following imputation, variants with IMPUTEv2 (ref. 34) info score <0.7 were filtered out. The final imputed variant set shows excellent genotype and minor allele concordance across the MAF spectrum compared to the array data (Supplementary Fig. 5). Average minor allele concordance was 94.6% for rare (MAF<1%) variants, 96.7% for low-frequency (1%5%). SNP density inside low-complexity regions (LCR) in the hg19 build was 6.5 times lower than in the accessible genome (Supplementary Note 2 and Supplementary Table 9).

Merged reference panel creation

A large reference panel was constructed by combining the WGS haplotypes from HELIC MANOLIS (9,554,503 variants and 249 samples), 1000 Genomes Project35 (27,449,245 variants and 1092 samples), and UK10K1 (25,109,897 variants and 3781 samples). For 1000 Genomes Project we used 1,000 Genomes Project haplotypes Phase I integrated variant set release (SHAPEIT2) in NCBI build 37 (hg19) coordinates. All ethnicities with singletons excluded (ALL.integrated_phase1_SHAPEIT_16-06-14.nosing) downloaded from the IMPUTEv2 (refs 33, 34) website (http://mathgen.stats.ox.ac.uk/impute/impute_v2.1.0.html). For UK10K the haplotypes were prepared and described previously12. IMPUTEv2 (refs 33, 34) was used to merge the haplotypes in a two-step process; firstly merging the 1000 Genomes Project with the UK10K datasets and secondly merging the HELIC MANOLIS with the UK10K-1000 Genomes Project reference haplotypes. The merged reference panel contained 5,211 samples and 38,810,554 variants.

HELIC MANOLIS WGS SNV frequencies and functional annotation

The human genome was split to functional regions as follows: coding and UTR sequences were extracted from GENCODE annotations (Release 19, mapped to GRCh37 build36), upstream/downstream regions, introns and splice regions were derived from GENCODE data following Sequence Ontology definitions37. Intergenic regions were defined as regions where no GENCODE genes were overlapping the ungapped human genome (Ensembl release 75, build GRCh37). Bedtools38 was used to find overlapping variants with each genomic regions then the average frequency was calculated (Fig. 2a). Using Ensembl variant effect predictor (VEP) (http://www.ensembl.org, version 75, on build GRCH37)39, the most severe consequence term was assigned to each autosomal SNV in the phased and imputed dataset (the same dataset is used in all subsequent analyses). Consequences were pooled into eight consequence categories: 3′/5′ UTR, coding sequence variant, intergenic variant, intron variant, splice-region variant, non-coding transcript variant, upstream/downstream variant and regulatory variant (Supplementary Table 10). Variants were grouped into the following bins: MAC=2, MAC>2 and MAF≤1%, 1%5%. The percentage of variants with a given consequence term in each MAF bin was calculated (Supplementary Table 2 and Supplementary Fig. 1).

HELIC MANOLIS 4 × WGS reference panel comparisons

The autosomal SNVs in the 249 HELIC MANOLIS 4 × WGS reference haplotype dataset were binned according to the observed MAF. Each variant was checked to establish if it was present in the UK10K (n=3718) and/or the 1000 Genomes Project (n=1092) reference haplotype dataset (Fig. 2).

Functional enrichment of variants private to HELIC MANOLIS

Variants discovered in the 4 × WGS were separated into two groups, those variants shared with UK10K or the 1000 Genomes reference dataset (9,030,004 variants) or those unique to HELIC MANOLIS (524,499 variants). To compare the distributions of consequence terms in the shared and the unique datasets a two-sided proportion test was used for each MAF bin, using the consequence and MAF bins described above. Fold enrichment was calculated with the following equation: Where E is the fold enrichment of consequence c in M MAF bin; and are the number of variants in M MAF bin with consequence c in the unique and the shared dataset respectively; and are the total number of variants in the M MAF bin in the unique dataset and the shared dataset respectively. Bonferroni correction of the P was applied to account for multiple testing (Fig. 3 and Supplementary Table 4).

Array genotyping and quality control

The MANOLIS and Pomak cohorts were each genotyped in two tranches: one on the Illumina HumanOmniExpress BeadChip and Illumina HumanExome BeadChip, and one on the Illumina HumanCoreExome beadchip (Illumina, San Diego, CA, USA) at the Wellcome Trust Sanger Institute, Hinxton, UK. The two datasets for each cohort were phased, imputed and analysed separately (Fig. 1). Quality control (QC) for the samples genotyped using the OmniExpress genotypes has been previously described24. The same samples were genotyped using the HumanExome BeadChipv1.1 at the Wellcome Trust Sanger Institute, Hinxton, UK and called with Illumina Genome Studio Gencall, and zCall40. The calling and QC were undertaken separately for the Pomak and MANOLIS cohorts using a step-wise QC approach which consisted of GenCall sample QC followed by zCall sample and variant QC (Supplementary Methods). The genotypes from the OmniExpress and HumanExome chips were merged into a single dataset. For variants present in OmniExpress and HumanExome the genotypes for those with MAF≥5% were taken from the OmniExpress while those with MAF<5% were taken from the HumanExome. This merged genotype, referred to as the ‘OmniExome' dataset, contained 1265 samples and 621,908 variants for the MANOLIS and 1003 samples and 612,403 variants for Pomak. Additional HELIC MANOLIS and Pomak samples were genotyped on the Illumina HumanCoreExome-12-v1.1 (Illumina) at the Wellcome Trust Sanger Institute, Hinxton, UK. Genotypes were called with GenCall and zCall (Supplementary Methods). In MANOLIS 211 samples and 529,604 variants, and in Pomak 734 samples and 529,086 variants passed QC.

Phasing and imputation

Each cohort and array was phased and imputed separately (Fig. 1). Before phasing, variants were excluded that were duplicates, monomorphics, singleton variants, had poor intensity clustering, or had allelic differences between the array and reference panel. Samples not genotyped on both the OmniExpress and Exome chip arrays were excluded, as well as variants with MAF <5% genotyped on the OmniExpress. Samples were phased using SHAPEIT v2.r778 (ref. 33) and imputed using IMPUTE v2.3.1 (refs 33, 34). Following imputation, any variant with HWE P<1.00 × 10−4 or imputation information score <0.4 was excluded. There was good genotype concordance between the 249 overlapping samples in the imputed and WGS (Supplementary Note 3).

Phenotype preparation

Thirty-one phenotypes encompassing cardiometabolic, anthropomorphic and haematological traits were prepared separately for each cohort and array (Supplementary Table 11). If gender differences were significant (Wilcoxon rank sum, P<0.05), the phenotype was stratified accordingly. Following trait-specific exclusions and adjustments, outliers were filtered out based on 3, 4 or 5 SD away from the mean. Traits not normally distributed were transformed to normality using either an inverse normal or log transformation. For all traits age and age2 were added as covariates as necessary and standardised residuals were used. Some traits are adjusted for body mass index (BMI). If male and female phenotypes were prepared separately these were standardised before combining the residuals. Summary statistics for all of the traits are provided in Supplementary Table 12.

Association analysis

GEMMA. Association analysis was performed separately for each cohort and array using the imputed genotypes. GEMMA41 was used for the analysis. This software allows accounting for relatedness at the array level by using a linear mixed model. A centred kinship matrix was generated using only the directly typed array genotypes. P values from the likelihood ratio test (p_lrt) are reported. For meta-analysis within and across cohorts we developed and used METACARPA. METACARPA. When meta-analysing GWAS, both the P values and effect sizes can be meta-analysed on a per-variant basis. Both meta-statistics are weighted sums., for example, for effect sizes: where is the estimator of a common effect across all studies, identifies the study among the K that should be meta-analysed, is the effect in study k and w is a study-specific weight. For P values, we transform to z-scores using , where is the cumulative distribution of the standard normal. Then: Then, z-scores are transformed back to P with the complement of the previous transformation: . In both cases, the variance needs to be derived. For both and it has the typical form of a variance of weighted sums: Var(z)=1 by construction and is taken from the input files. The previous equations require the covariances of the individual study statistics across all pairs of studies. We build a K × K variance-covariance matrix describing this ‘inter-study relatedness'. Lin and Sullivan8 propose the following for estimating study correlation in quantitative trait GWAS: which is the number of overlapping individuals in relation to the studies sample sizes and . However, in many cases is unknown, or the relatedness is subtler than a simple overlap. Province and Borecki7 propose the following: where and rtetrachoric is the tetrachoric correlation coefficient. We obtain covariances using , since σ and σ the variances of the statistic in each study, are known. It is assumed that for every (k,l), , that is, the general term for the variance-covariance matrix Ω for the P meta-analysis is , and the general term for the variance-covariance matrix Ω for the effect-size meta-analysis is . For weights, it is shown8 that in the case of overlapping samples, the are of the form: where is the unity vector of size K and Ω is the estimated covariance matrix of the effect sizes between studies with general term . For the P meta-analysis, the general term of Ω does not contain a factor accounting for unequal sample size. The following weight vector does: where s is a vector containing the sample sizes of all studies. The general term of the weight vector is , the relative sample size of study i. We implemented this method in C++ using the Boost libraries. For tetrachoric correlation, we use the approximation of Digby42, which has been shown to be valid when analysing equilibrated 2 × 2 tables of large sample sizes, which is the case when binary-transforming GWAS P values. This result was confirmed by comparing the approximated value with an iterative maximum likelihood estimator. Simulation and benchmark. This implementation was tested by repeatedly drawing two random sets of 2,000 samples each from the UKHLS GWAS dataset (EGA accession EGAD00010000890), with increasing sample overlap. Phenotypes were drawn from a standard normal. The two studies were associated separately using GEMMA41, then meta-analysed using METACARPA, and the whole process was repeated 1,000 times for each level of overlap. An uncorrected fixed-effects, sample size-weighted P value-based meta-analysis43 was implemented in the software for comparison, as well as an uncorrected inverse-variance weighted, effect size-based meta-analysis. We used degrees of overlap ranging from 0.5 to 75% of the total sample size (Fig. 4 and Supplementary Fig. 2). We assessed the false-positive rate calculated at a genome-wide significance threshold of 5.00 × 10−8, and the power to detect a single associated SNP. Effect SNPs were chosen randomly for each simulation, MAF and effect sizes were constrained so that the effect SNP explained 1% of phenotype variance. For typical to substantial levels of overlap (0.5–10%), false-positive rate grows linearly for both the two uncorrected and the two corrected methods (Fig. 4a). However, for the latter, the growth rate is reduced from 6 × 10−5%/sample to 5.5 × 10−5%/sample (8.3%). While for typical (0.5–5%) levels of overlap, power to detect a single SNP is conserved, for substantial levels of overlap (5–10%) it drops at an approximate rate of 0.05%/sample. For extensive levels of overlap (10–75%), the increase in false-positive rate slows further and stabilizes around 9% for overlaps greater than 50% for both corrected methods (Supplementary Fig. 2), whereas uncorrected methods keep growing at an unchanged rate. Owing to the reduction in effective sample size, power decreases to below 60% for very high levels of overlaps. At the levels of overlap inferred in the HELIC datasets (1.96 and 1.84%), power is decreased by 0.1% and false-positive rate is decreased by 0.2% between the corrected and uncorrected effect-size based meta-analyses. We evaluated the accuracy of tetrachoric correlation in estimating the true simulated sample overlap (Supplementary Fig. 6) compared to Pearson's correlation of z-scores. Although both methods systematically underestimated sample overlap, tetrachoric correlation performed poorly compared to Pearson's when all SNPs were under the null (Supplementary Fig. 6). Tetrachoric correlation's main advantage is to ignore outliers, hence it may be overconservative under the null. Under a simulated polygenic burden across 10,000 SNPs for a trait that is 20% heritable under 25% sample overlap, both methods overestimated correlation but tetrachoric correlation was more accurate than Pearson's (Fig. 4c). This suggests that tetrachoric correlation is able to correct for the presence of a relatively high number of truly associated, correlated SNPs, a scenario which is expected to arise when analysing highly polygenic traits. Implementation. This method is implemented in the METACARPA software (META-analysis in C++ Accounting for Relatedness using arbitrary Precision Arithmetic). Binary and sources are freely available (https://github.com/wtsi-team144/metacarpa, http://www.sanger.ac.uk/science/tools/metacarpa).

Prioritization and validation

Variants were prioritized for validation by direct genotyping from the meta-analysis across cohorts, keeping only the most significant SNV with P≤5.00 × 10−8 in a 500 kbp window around any given signal. Variants that were genome-wide significant in the within-cohort meta-analysis and not within 500 kbp of the across cohort meta-analysis signal were also considered. Replication is demonstrated in the within-cohort meta-analysis by nominal significance (two-sided P≤0.05) in the same direction in both datasets (we relaxed this to P≤0.08 for MANOLIS CoreExome for 2 variants due to small sample size). For the across cohort meta-analysis both within-cohort meta-analysis have P≤0.05. To determine the Bonferroni corrected genome-wide significance level, for each cohort array we used the eigenvalues of the correlation matrix of the 31 traits tested44 to calculate the effective number of independent phenotypes. Then, the genome-wide P threshold to control FWER at 5%, using a Bonferroni correction, is 5.00 × 10−8/effective number of independent phenotypes. We selected the cohort array with the maximum number of independent phenotypes for the calculation, which was 5.00 × 10−8/14.99=3.33 × 10−9. Prioritized variants were independently genotyped in as many of the imputed samples as possible using the Sequenom iPLEX Assay and the Sequenom MassARRAY System (Agena Bioscience) (Supplementary Methods). Sixty two variants were directly genotyped in a maximum of 2,778 samples. Concordances of the major and minor alleles were calculated separately. The minor allele concordance and the positive predictive value (PPV), which is the fraction of true positives for the minor allele calls, were used to assess the imputed genotype quality. Phenotypes were prepared again only for the samples with directly typed genotypes and the association and meta-analysis were repeated. Concordance and PPV were calculated as follows: The proportion of variants that had both concordance and PPV>90% were: MANOLIS CoreExome 54.8%; MANOLIS OmniExome 53.2%; Pomak CoreExome 50%; Pomak OmniExome 54.8%. All variants reported here pass validation (Supplementary Table 5). For the weight signal with rs112037309 we validated a proxy rs17262443 which has r2=1 with rs112037309 (rs17262443, P=3.69 × 10−8). Conditional analysis confirmed these represent the same signal.

Data availability

The following HELIC genotype and WGS datasets have been deposited to the European Genome-phenome Archive (https://www.ebi.ac.uk/ega/home): EGAD00010000518; EGAD00010000522; EGAD00010000610; EGAD00001001636. We have also contributed the 249 HELIC MANOLIS whole-genome sequences to the Haplotype Reference Consortium (http://www.haplotype-reference-consortium.org).

Additional information

How to cite this article: Southam, L. et al. Whole genome sequencing and imputation in isolated populations identify genetic associations with medically-relevant complex traits. Nat. Commun. 8, 15606 doi: 10.1038/ncomms15606 (2017). Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

43 in total

1. Meta-analysis of correlated traits via summary statistics from GWASs with an application in hypertension.

Authors: Xiaofeng Zhu; Tao Feng; Bamidele O Tayo; Jingjing Liang; J Hunter Young; Nora Franceschini; Jennifer A Smith; Lisa R Yanek; Yan V Sun; Todd L Edwards; Wei Chen; Mike Nalls; Ervin Fox; Michele Sale; Erwin Bottinger; Charles Rotimi; Yongmei Liu; Barbara McKnight; Kiang Liu; Donna K Arnett; Aravinda Chakravati; Richard S Cooper; Susan Redline
Journal: Am J Hum Genet Date: 2014-12-11 Impact factor: 11.025

2. Large-scale whole-genome sequencing of the Icelandic population.

Authors: Daniel F Gudbjartsson; Hannes Helgason; Sigurjon A Gudjonsson; Florian Zink; Asmundur Oddson; Arnaldur Gylfason; Soren Besenbacher; Gisli Magnusson; Bjarni V Halldorsson; Eirikur Hjartarson; Gunnar Th Sigurdsson; Simon N Stacey; Michael L Frigge; Hilma Holm; Jona Saemundsdottir; Hafdis Th Helgadottir; Hrefna Johannsdottir; Gunnlaugur Sigfusson; Gudmundur Thorgeirsson; Jon Th Sverrisson; Solveig Gretarsdottir; G Bragi Walters; Thorunn Rafnar; Bjarni Thjodleifsson; Einar S Bjornsson; Sigurdur Olafsson; Hildur Thorarinsdottir; Thora Steingrimsdottir; Thora S Gudmundsdottir; Asgeir Theodors; Jon G Jonasson; Asgeir Sigurdsson; Gyda Bjornsdottir; Jon J Jonsson; Olafur Thorarensen; Petur Ludvigsson; Hakon Gudbjartsson; Gudmundur I Eyjolfsson; Olof Sigurdardottir; Isleifur Olafsson; David O Arnar; Olafur Th Magnusson; Augustine Kong; Gisli Masson; Unnur Thorsteinsdottir; Agnar Helgason; Patrick Sulem; Kari Stefansson
Journal: Nat Genet Date: 2015-03-25 Impact factor: 38.330

3. Cloning and functional characterization of DSCAML1, a novel DSCAM-like cell adhesion molecule that mediates homophilic intercellular adhesion.

Authors: K L Agarwala; S Ganesh; Y Tsutsumi; T Suzuki; K Amano; K Yamakawa
Journal: Biochem Biophys Res Commun Date: 2001-07-20 Impact factor: 3.575

4. Genetic characterization of Greek population isolates reveals strong genetic drift at missense and trait-associated variants.

Authors: Kalliope Panoutsopoulou; Konstantinos Hatzikotoulas; Dionysia Kiara Xifara; Vincenza Colonna; Aliki-Eleni Farmaki; Graham R S Ritchie; Lorraine Southam; Arthur Gilly; Ioanna Tachmazidou; Segun Fatumo; Angela Matchan; Nigel W Rayner; Ioanna Ntalla; Massimo Mezzavilla; Yuan Chen; Chrysoula Kiagiadaki; Eleni Zengini; Vasiliki Mamakou; Antonis Athanasiadis; Margarita Giannakopoulou; Vassiliki-Eirini Kariakli; Rebecca N Nsubuga; Alex Karabarinde; Manjinder Sandhu; Gil McVean; Chris Tyler-Smith; Emmanouil Tsafantakis; Maria Karaleftheri; Yali Xue; George Dedoussis; Eleftheria Zeggini
Journal: Nat Commun Date: 2014-11-06 Impact factor: 14.919

5. Genetic determinants of haemolysis in sickle cell anaemia.

Authors: Jacqueline N Milton; Helen Rooks; Emma Drasar; Elizabeth L McCabe; Clinton T Baldwin; Efi Melista; Victor R Gordeuk; Mehdi Nouraie; Gregory R Kato; Gregory J Kato; Caterina Minniti; James Taylor; Andrew Campbell; Lori Luchtman-Jones; Sohail Rana; Oswaldo Castro; Yingze Zhang; Swee Lay Thein; Paola Sebastiani; Mark T Gladwin; Martin H Steinberg
Journal: Br J Haematol Date: 2013-02-14 Impact factor: 6.998

6. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

7. Genome-wide association analyses based on whole-genome sequencing in Sardinia provide insights into regulation of hemoglobin levels.

Authors: Fabrice Danjou; Magdalena Zoledziewska; Carlo Sidore; Maristella Steri; Fabio Busonero; Andrea Maschio; Antonella Mulas; Lucia Perseu; Susanna Barella; Eleonora Porcu; Giorgio Pistis; Maristella Pitzalis; Mauro Pala; Stephan Menzel; Sarah Metrustry; Timothy D Spector; Lidia Leoni; Andrea Angius; Manuela Uda; Paolo Moi; Swee Lay Thein; Renzo Galanello; Gonçalo R Abecasis; David Schlessinger; Serena Sanna; Francesco Cucca
Journal: Nat Genet Date: 2015-09-14 Impact factor: 38.330

8. A rare functional cardioprotective APOC3 variant has risen in frequency in distinct population isolates.

Authors: Ioanna Tachmazidou; George Dedoussis; Lorraine Southam; Aliki-Eleni Farmaki; Graham R S Ritchie; Dionysia K Xifara; Angela Matchan; Konstantinos Hatzikotoulas; Nigel W Rayner; Yuan Chen; Toni I Pollin; Jeffrey R O'Connell; Laura M Yerges-Armstrong; Chrysoula Kiagiadaki; Kalliope Panoutsopoulou; Jeremy Schwartzentruber; Loukas Moutsianas; Emmanouil Tsafantakis; Chris Tyler-Smith; Gil McVean; Yali Xue; Eleftheria Zeggini
Journal: Nat Commun Date: 2013 Impact factor: 14.919

9. The UK10K project identifies rare variants in health and disease.

Authors: Klaudia Walter; Josine L Min; Jie Huang; Lucy Crooks; Yasin Memari; Shane McCarthy; John R B Perry; ChangJiang Xu; Marta Futema; Daniel Lawson; Valentina Iotchkova; Stephan Schiffels; Audrey E Hendricks; Petr Danecek; Rui Li; James Floyd; Louise V Wain; Inês Barroso; Steve E Humphries; Matthew E Hurles; Eleftheria Zeggini; Jeffrey C Barrett; Vincent Plagnol; J Brent Richards; Celia M T Greenwood; Nicholas J Timpson; Richard Durbin; Nicole Soranzo
Journal: Nature Date: 2015-09-14 Impact factor: 49.962

10. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers.

Authors: Carlo Sidore; Fabio Busonero; Andrea Maschio; Eleonora Porcu; Silvia Naitza; Magdalena Zoledziewska; Antonella Mulas; Giorgio Pistis; Maristella Steri; Fabrice Danjou; Alan Kwong; Vicente Diego Ortega Del Vecchyo; Charleston W K Chiang; Jennifer Bragg-Gresham; Maristella Pitzalis; Ramaiah Nagaraja; Brendan Tarrier; Christine Brennan; Sergio Uzzau; Christian Fuchsberger; Rossano Atzeni; Frederic Reinier; Riccardo Berutti; Jie Huang; Nicholas J Timpson; Daniela Toniolo; Paolo Gasparini; Giovanni Malerba; George Dedoussis; Eleftheria Zeggini; Nicole Soranzo; Chris Jones; Robert Lyons; Andrea Angius; Hyun M Kang; John Novembre; Serena Sanna; David Schlessinger; Francesco Cucca; Gonçalo R Abecasis
Journal: Nat Genet Date: 2015-09-14 Impact factor: 38.330

47 in total

Review 1. Exploring the dark genome: implications for precision medicine.

Authors: Tudor I Oprea
Journal: Mamm Genome Date: 2019-07-04 Impact factor: 2.957

2. Characterization of Exome Variants and Their Metabolic Impact in 6,716 American Indians from the Southwest US.

Authors: Hye In Kim; Bin Ye; Nehal Gosalia; Çiğdem Köroğlu; Robert L Hanson; Wen-Chi Hsueh; William C Knowler; Leslie J Baier; Clifton Bogardus; Alan R Shuldiner; Cristopher V Van Hout
Journal: Am J Hum Genet Date: 2020-07-07 Impact factor: 11.025

3. Genetic Epidemiology of Complex Phenotypes.

Authors: Darren D O'Rielly; Proton Rahman
Journal: Methods Mol Biol Date: 2021

4. A Large Multiethnic Genome-Wide Association Study of Adult Body Mass Index Identifies Novel Loci.

Authors: Thomas J Hoffmann; Hélène Choquet; Jie Yin; Yambazi Banda; Mark N Kvale; Maria Glymour; Catherine Schaefer; Neil Risch; Eric Jorgenson
Journal: Genetics Date: 2018-08-14 Impact factor: 4.562

Review 5. Evolutionary perspectives on polygenic selection, missing heritability, and GWAS.

Authors: Lawrence H Uricchio
Journal: Hum Genet Date: 2019-06-14 Impact factor: 4.132

6. VCSEL: PRIORITIZING SNP-SET BY PENALIZED VARIANCE COMPONENT SELECTION.

Authors: Juhyun Kim; Judong Shen; Anran Wang; Devan V Mehrotra; Seyoon Ko; Jin J Zhou; Hua Zhou
Journal: Ann Appl Stat Date: 2021-12-21 Impact factor: 2.083

7. A genome-wide search for gene-by-obesity interaction loci of dyslipidemia in Koreans shows diverse genetic risk alleles.

Authors: Moonil Kang; Joohon Sung
Journal: J Lipid Res Date: 2019-10-29 Impact factor: 5.922

Review 8. Benefits and limitations of genome-wide association studies.

Authors: Vivian Tam; Nikunj Patel; Michelle Turcotte; Yohan Bossé; Guillaume Paré; David Meyre
Journal: Nat Rev Genet Date: 2019-08 Impact factor: 53.242

9. The Genetics of the Mood Disorder Spectrum: Genome-wide Association Analyses of More Than 185,000 Cases and 439,000 Controls.

Authors: Jonathan R I Coleman; Héléna A Gaspar; Julien Bryois; Gerome Breen
Journal: Biol Psychiatry Date: 2019-11-01 Impact factor: 13.382

Review 10. The Form and Function of PIEZO2.

Authors: Marcin Szczot; Alec R Nickolls; Ruby M Lam; Alexander T Chesler
Journal: Annu Rev Biochem Date: 2021-06-20 Impact factor: 23.643