Literature DB >> 28643794

Enrichment of low-frequency functional variants revealed by whole-genome sequencing of multiple isolated European populations.

Yali Xue¹, Massimo Mezzavilla^1,2, Marc Haber¹, Shane McCarthy¹, Yuan Chen¹, Vagheesh Narasimhan¹, Arthur Gilly¹, Qasim Ayub¹, Vincenza Colonna^1,3, Lorraine Southam^1,4, Christopher Finan¹, Andrea Massaia^1,5, Himanshu Chheda⁶, Priit Palta^6,7, Graham Ritchie^1,8,9, Jennifer Asimit¹, George Dedoussis¹⁰, Paolo Gasparini¹¹, Aarno Palotie^{1,6,12,13,14,15,16}, Samuli Ripatti^1,6,17, Nicole Soranzo^1,18, Daniela Toniolo¹⁹, James F Wilson^9,20, Richard Durbin¹, Chris Tyler-Smith¹, Eleftheria Zeggini¹.

Abstract

The genetic features of isolated populations can boost power in complex-trait association studies, and an in-depth understanding of how their genetic variation has been shaped by their demographic history can help leverage these advantageous characteristics. Here, we perform a comprehensive investigation using 3,059 newly generated low-depth whole-genome sequences from eight European isolates and two matched general populations, together with published data from the 1000 Genomes Project and UK10K. Sequencing data give deeper and richer insights into population demography and genetic characteristics than genotype-chip data, distinguishing related populations more effectively and allowing their functional variants to be studied more fully. We demonstrate relaxation of purifying selection in the isolates, leading to enrichment of rare and low-frequency functional variants, using novel statistics, DVxy and SVxy. We also develop an isolation-index (Isx) that predicts the overall level of such key genetic characteristics and can thus help guide population choice in future complex-trait association studies.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28643794 PMCID： PMC5490002 DOI： 10.1038/ncomms15927

Source DB: PubMed Journal: Nat Commun ISSN： 2041-1723 Impact factor: 14.919

Population variation in disease susceptibility has been shaped by environment, demography and evolutionary history. Isolated populations (isolates) have generally experienced bottlenecks and strong genetic drift, so by chance some deleterious rare variants have increased in frequency while some neutral rare variation is lost, both helpful characteristics for the discovery of novel rare variant signals underpinning complex traits123. Studies to date have focused on individual isolates and have identified several disease-associated signals456789101112. However, isolates differ in the time when they became isolated, their initial population size, the level of gene flow from outside and other historical demographic factors, and consequently also differ in their power for association studies2. We thus generate and analyse low-depth (4 × –10 × ) whole-genome sequences (WGS) from eight cohorts drawn from isolated European populations and compare each isolate with the closest non-isolated (general) population, for which we also generate or access WGS data. We then investigate empirically how these historical differences influence the population-genetic properties of isolates, and frame these insights in terms of their consequences for study design in complex trait association studies.

Results

Samples, sequencing and QC

The data set includes newly generated low-depth (4x–10x) WGS from eight cohorts drawn from isolated European populations: one each from Kuusamo in Finland (FIK) and Crete in Greece (GRM13), four from Friuli-Venezia Giulia in Italy (IF1, IF2, IF3 and IF4 (ref. 14)), and one each from Val Borbera in Italy (IVB15) and the Orkney Islands in the UK (UKO16); and the closest non-isolated (general) population: Finland (FIG9), Greece (GRG), together with publicly available data for Italy (ITG17) and UK (UKG18) (Fig. 1a and Supplementary Table 1). We generated a superset of variants called in these cohorts and all 26 population samples in the 1000 Genomes Project Phase 3 (ref. 17), and performed multi-sample genotype calling across all 9,375 samples (3,059 from the current study, 2,353 from the 1000 Genomes Project Phase 3 release and 3,781 from UK10K). Both individual population and amalgamated genotype call data, which have greater than 99% concordance with genotyping data (Supplementary Table 2), are available to the scientific community (Data availability).

Figure 1

General characteristics and demographic history of isolated and matched general populations.

(a) Geographical locations of samples. The base map was plotted in R using the mapdata package and circles were added using Photoshop. (b) PCA using common variants. (c) PCA using low-frequency variants. (d) Sharing of rare variants within and between populations. Upper left triangle: f2 variants; lower right triangle f3–f10 variants. (e) Effective population size (Ne) inferred from IBDNe for UKO and UKG during the past nine KY. (f) The lowest Ne inferred by IBDNe for all populations for the past three KY, plotted as a function of the time at which it occurred.

General description of the variants in the isolates

We identified approximately 12.2 million variants with minor allele frequency (MAF) ≤2% (rare), 5.5 million with MAF >2–≤5% (low-frequency) and 8.3 million variants with MAF >5% (common) across the ten populations newly sequenced here (eight isolates, GRG and FIG). Of these, 10.5, 0.7 and 0.3%, respectively, are novel (Table 1 and Supplementary Table 3). As expected, most of the isolates have lower numbers of variant sites per genome than their closest general population (Supplementary Fig. 1, Supplementary Table 5). We find ∼188,000–∼513,000 variants that are common with MAF >5.6% in each isolate but with MAF≤1.4% in its closest general population (Table 1); ∼30,000–122,000 of these per isolate have frequency ≤1.4% in all the general samples studied, among which ∼150–∼700 in coding regions and ∼500–∼2,800 genome-wide are deleterious (Supplementary Table 4). These common and low-frequency variants are thus useful markers for whole-genome association studies in these populations and some of them (if absent from the general population) could potentially lead to novel association signals. They include known examples such as rs76353203 (R19X) in APOC3 in GRM, which is associated with high-density lipoprotein and triglyceride levels6.

Table 1

Summary of variants discovered in this study.

POP	n	average depth	MAF≤2%		MAF>2–≤5%		MAF>5%		Novel common SNPs in isolate*	Novel common SNPs in isolate†
			total	novel%	total	novel%	total	novel%
FIK	377	4x	4,066,373	10.90	1,553,076	1.20	6,025,077	0.70	190,527	70,579
FIG	1,564	6x	6,548,833	11.80	1,540,915	0.80	6,053,704	0.70	na	na
GRM	249	4x	5,129,513	7.20	1,447,981	1.10	6,111,923	0.80	513,272	49,884
GRG‡	99	10–30x	3,757,110	na	1,321,955	na	5,842,537	na	na	na
IF1	60	4–10x	1,456,881	1.30	1,420,929	1.30	5,890,714	0.80	320,191	119,157
IF2	45	4–10x	1,063,098	1.30	1,554,145	1.00	6,001,568	0.80	273,694	94,496
IF3	47	4–10x	961,059	1.30	1,455,284	1.10	6,068,304	0.80	299,603	107,281
IF4	36	4–10x	1,030,673	1.30	1,124,789	1.10	6,001,625	0.80	308,356	122,254
IVB	222	6x	4,857,767	1.60	1,396,799	0.80	6,112,476	0.80	188,972	30,284
UKO	397	4x	5,963,416	11.70	1,471,782	0.80	6,047,383	0.80	193,300	36,512
Total	3,096		12,218,797	10.50	5,503,179	0.70	8,301,524	0.30

‘Novel’ variants are those not found in 1000 Genomes Project Phase 3 or UK10K project.

*Variants that are common (minor allele frequency, MAF≥5.6%, alternative allele count ≥4) in an isolated population but not common (MAF<1.4%, alternative allele count ≤1) in its closest general population.

†Variants that are common (MAF≥5.6%, alternative allele count ≥4) in an isolated population but not (MAF<1.4%, alternative allele count ≤1) in any of the general populations.

‡Different variant calling procedure in this population.

Population-genetic analyses in the isolates

Previous population-genetic studies of isolates have, with some exceptions1119, been based on common variants found on genotyping arrays, and have illustrated general characteristics such as low genetic diversity and longer shared haplotypes91314151920. Rare variants discovered from sequencing are on average more recent in origin than common variants21 and therefore more powerful for distinguishing closely related populations and more informative about recent demographic history. We find that isolates are, as expected, genetically close to their matched general population in principal component analyses (PCA), ADMIXTURE22 and TreeMix23 using common variants (Fig. 1b, Supplementary Figs 2–5 and Supplementary Table 6), but PCA using rare and low-frequency variants, as found previously24, distinguishes them more clearly from the general population and also from other isolates, particularly among the Italian samples (Fig. 1c, Supplementary Fig. 2). The majority of sharing of variants present just twice across all samples of 36 individuals from each population (f2 variants21) takes place within the same population, and the isolates generally share more with their closest general population than with other populations. This latter trend, however, is not apparent for IF1–IF4, who show little sharing with any other population, pointing to a greater level of isolation and lower level of gene flow with their general population (Fig. 1d, upper triangle and Supplementary Fig. 7), which is confirmed by f3-statistics25 comparing with a worldwide population panel of HGDP-CEPH samples using common SNPs (Supplementary Fig. 6). f3–f10 variant sharing demonstrates sharing by ITG and IVB with both Greek and UK populations (Fig. 1d, lower triangle and Supplementary Fig. 7), potentially indicative of their more ancient heritage.

Population demographic history

All populations studied here, both isolates and general, appear to have shared a comparable effective population size (Ne) history before 20 thousand years ago (KYA) based on the multiple sequentially Markovian coalescent method26 (Supplementary Fig. 9). The isolates diverged from their general populations within the last ∼5,000 years based on LD estimations27 (Supplementary Table 7 and Supplementary Fig. 8) and yet had sharp decreases in their population sizes in more recent times as estimated using inferred long segments of identity by descent (IBD)28 (Fig. 1e,f and Supplementary Fig. 10). Different isolates also split from their respective general populations at different times. For example, IF1–IF4 split from ITG ∼4–5 KYA, while most other isolates split from their general populations within the last ∼1,000 years (Supplementary Table 7). The different demographic histories of different isolates should lead to different genetic characteristics. To summarize these features in a single quantitative measure that can be calculated from genotype data, as well as sequence data, we developed an isolation index (Isx) which combines information on the divergence time from the general population (Tdg), Ne and migration rate (M), such that early-divergence-time isolates with small Ne and low M have a high Isx value (Fig. 2a and Supplementary Fig. 11). The different isolates show different Isx values: IF1, IF2, IF3 and IF4 have the highest, while IVB has the lowest (Supplementary Table 8). Isx values are highly correlated with other population-genetic characteristics (for example, Fig. 2b,c, Supplementary Table 11), such as genome-wide pairwise FST between isolates and their matching general population (reflecting the genetic drift of the isolates) (Supplementary Fig. 12), the total length and number of runs of homozygosity (ROH) (Supplementary Fig. 13), inbreeding coefficient (F) (Supplementary Fig. 14) and length of LD (Supplementary Figs 15 and 16 and Supplementary Tables 9 and 10). All these characteristics are correlated, but the pairwise correlation coefficients show that Isx is a slightly better overall predictor of the other measures than any single existing measure (Fig. 2c, Supplementary Fig. 17 and Supplementary Table 11); moreover, it is potentially more robust to confounding factors as it is calculated from three demographic parameters, while the others are all based on single measurements.

Figure 2

Isolation index (Isx) and its correlation with other genetic measures.

(a) Information summarized in Isx. (b) Example of the correlation between Isx and other statistics, here DVxy-coding. (c) Summary of the correlations between Isx and other population-genetic statistics. All the correlation coefficients are high and statistically significant.

Purifying selection analyses

Several lines of evidence suggest relaxed purifying selection in the isolates due to their reduced Ne, although as expected we do not detect substantially increased genetic load per genome using the Rxy statistic29 based on all of the variants in the genomes (Fig. 3a and Supplementary Table 12). First, we see different levels of enrichment of low-frequency functional variants in isolates (Fig. 3b,c, Supplementary Tables 13 and 14, Supplementary Fig. 18a) quantified by a new statistic, DVxy-coding, developed here (DV: drifted variants). DVxy-coding measures the ratio of functional coding variants (missense plus loss-of-function (LoF)) in isolates compared to the closest general population (and vice-versa), adjusted for the corresponding ratios of intergenic variants in order to correct for the effect of genetic drift. We applied this only to a subclass of DVs, defined as low-frequency (2–5%, the best choice according to the sample size we have) in any isolate, yet at least three-fold higher than in the closest general population (and vice versa). We find that DVxy-coding is >1 in all isolates and <1 in all general populations (Fig. 3c, Supplementary Fig. 18a and Supplementary Table 13). We also calculated a similar DVxy-wg statistic by stratifying whole-genome variants according to their combined annotation dependent depletion (CADD) score (0–5, neutral variants; 5–10, mildly deleterious; 10–20, deleterious; and >20, highly deleterious; these cut-off choices balance the number of variants in each bin to allow us comparable statistical power among all bins, although the conclusions are robust to the particular cut-off values chosen and different bins (Supplementary Figs 18b and 19)). The DVxy-wg values are differentiated for variants with CADD score of 10–20 and significantly so (assessed using the jack-knife bootstrap method) for ones with CADD scores >20, with DVxy-wg values >1 in all isolates and <1 in all general populations (Fig. 3b, Supplementary Fig. 18b and Supplementary Table 14). This demonstrates enrichment of low-frequency functional variants, both coding and genome-wide with CADD score >10, in the isolated populations. Moreover, both DVxy-coding and DVxy-wg values are correlated with Isx, suggesting that different isolation characteristics lead to different levels of enrichment of functional variants.

Figure 3

Purifying selection in the isolates and general populations.

(a) Rxy-missense statistic in each isolate, showing no evidence for increased genetic load in the isolates. The mean and s.d. for each Rxy value from 100 bootstraps are shown. (b) DVxy-wg (DVxy-whole genome) statistic in isolates and general populations, stratified by CADD score, showing enrichment of highly functional low-frequency variants. (c) DVxy-coding statistic in isolates and general populations, showing enrichment of low-frequency missense variants in isolates. (d) SVxy-missense statistic in each isolate, showing relaxation of purifying selection in isolates in singletons. The s.e.'s for both DVxy and SVxy were calculated by randomly sampling data from 20 chromosomes 100 times. All of these analyses are based on the minimum-sample-size data set (36 individuals from each population).

We also investigated the relaxation of purifying selection by assessing functional (missense) singleton variants (SV) pooled for all of the genes that have at least one singleton missense or synonymous variant in a pair of populations (one isolate and its general population), correcting with pooled synonymous variants (SVxy statistic,). We find a substantial deviation from 1 for functional singletons in all of the isolates (Fig. 3d and Supplementary Table 15), with SVxy values positively correlating with Isx (Fig. 2c and Supplementary Fig. 20). We also find that the proportion of relaxed essential genes30 with SVxy >1 in isolates is significantly higher than in the general population (Supplementary Table 15). Such rare and low-frequency drifted functional variants, measured by both SVxy and DVxy, are particularly relevant for boosting the power of association studies6.

Positive selection analyses

We do not find convincing evidence for positive selection in any isolate using deltaDAF31, PCAdapt32 or singleton density score (SDS)33, although we do identify some highly differentiated variants (Supplementary Fig. 21 and Supplementary Tables 16 and 17), including in the protein-coding genes ALK, SPNS2, SLC39A11 and ACSS2, which can nevertheless be accounted for by drift. Interestingly, we also find six highly differentiated variants shared between different isolates from Italy, IF2, IF3 and IF4, but interpret them as likely to result from drift or positive selection for the ancestral allele in the ITG (Supplementary Table 17). We find that the SDS method has little power in our samples because of their small size, and failed to detect selection even at the lactose tolerance SNP in the UKO, a known strong signal of recent selection (Supplementary Fig. 22).

Discussion

Isolated populations have special characteristics that can be leveraged to increase the power of association studies, as several previous studies have shown1934. Nevertheless, only a small proportion of functional variants have increased in frequency in any one isolate, so multiple isolates must be investigated to reveal the full diversity of associated variants. Here, we probed an extended allele frequency spectrum of variants potentially underpinning human complex disease through the analysis of WGS data in multiple isolates matched to nearby non-isolated populations, capturing common, low-frequency and rare variants. We quantified different levels of isolation resulting from different demographic histories and have demonstrated that the Isx statistic, calculated even from SNP-chip data, reliably captures these relevant features. This study provides a systematic evaluation of the genetic characteristics of multiple European isolates and for the first time empirically demonstrates enrichment of rare functional variants across multiple isolates. With the advent of large-scale whole-genome sequencing, studies in isolates are poised to continue as major contributors to our understanding of complex disease etiology.

Methods

Data set and variant calling

The data set includes 3,059 whole-genome low-depth sequences generated at The Wellcome Trust Sanger Institute using the Illumina Genome Analyzer II and Illumina HiSeq 2000 platforms, as well as 100 high-depth sequences from the Illumina HiSeq X Ten (Fig. 1a and Supplementary Table 1). Informed consent was obtained from all subjects and the study was approved by the HMDMC (Human Materials and Data Management Committee) of the Welcome Trust Sanger Institute. The multi-sample genotype calling across all of the low-coverage sequencing data from the current study, as well as 2,353 from the 1000 Genomes Project Phase 3 release, and 3,781 from UK10K (a total of 9,375) was performed with the defined site selection criteria (Supplementary Note). Genotype likelihoods were calculated with samtools/bcftools (0.2.0-rc9) and then genotypes were called and phased using Beagle v4 (r1274) (ref. 35). We assessed the performance of the genotype calling from the low coverage data using the available genotype chip data for a subset of the cohorts consisting of 4,665 individuals, and calculated the discordance rates on chromosome 20 separately for the categories REF-REF, REF-ALT and ALT-ALT. The sample sizes are very different across these collections, and we used three different standard-sized subsets of the samples for different analyses: (1) the whole data set; (2) the sample-size-matched data set, obtained either by randomly selecting samples from general population to match the isolated population (for example, we randomly select 377 from FIG to match FIK), or by randomly selecting a subset of the isolated population to match the general population (for example, we randomly select 108 IVB to match the general population ITG); (3) the minimum-sample-size data set of 36 individuals per population. By doing this, we maximize the use of the data for different analyses, and we specify which data set is used for each analysis. The sequencing depth is also different across different populations, within a 2.5-fold range (apart from GRG, in which variants were called differently, details in Supplementary Notes), and we allowed for these differences when interpreting the results.

Variant counts

We first re-annotated all variants using the Variant Effect Predictor annotation from Ensembl 76 with the ‘- pick’ option, which gives one annotation per variant. We then performed variant counting at both the population and individual level, stratifying by functional categories and frequency bins. These counts were either plotted in figures or summarized as median values in tables. We carried out these analyses using both the sample-size-matched data set and the minimum-sample-size data set.

Population-genetic analyses

We used the whole data set for the analyses in this section, unless otherwise specified. PCAs were performed separately with common variants or rare variants using EIGENSTRAT v.501 (ref. 36). Shared ancestry between the populations studied here was evaluated using ADMIXTURE v1.22 (ref. 22). The relationships between the populations studied here, combined with worldwide populations from the HGDP-CEPH panel37, were also examined using ancestry graph analyses implemented in TreeMix v.1.12 (ref. 23). We also used formal test of f3-statistics25 to investigate population mixture in the history of the populations studied here, as well as worldwide populations from the HGDP-CEPH panel. Rare f2 variants (with only two copies of the alternative allele in the minimum-sample-size data set) and moderately rare f3 variants (3–10 copies of the alternative allele in the same data set) are particularly informative for investigating recent human history21. We investigated the sharing pattern of these two types of variant by summing all f2 variants or any random two alleles of the f3–10 variants shared by pairs of individuals. We plotted the results as a heat map using the image1 function from the base R package (https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/image.html). Variants were aggregated by pair of individuals using the ‘count’ function of the plyr package, then arranged in matrix form and colourized using ‘colorRampPalette’ from the colorspace package (https://cran.r-project.org/web/packages/colorspace/index.html). ROH, inbreeding coefficient (F) as well as the length of LD-blocks were calculated in PLINK, and finally genome-wide FST values between isolates and their general populations were calculated with the software 4P (ref. 38) using the minimum-sample-size data set.

Demographic inferences

LD-based394041 demographic inference was performed in the NeON R package27 using the minimum-sample-size data set; the median and confidence interval were estimated using the 50th, 5th and 95th percentiles of the distribution of long-term Ne in each time interval. We used the multiple sequentially Markovian coalescent method26 to infer demographic changes before 20,000 years ago using four individual sequences from each population. In order to account for some loss of heterozygous sites in the low-depth data, we used a slow mutation rate of 0.8 × 10−8 mutations per nucleotide per generation and a longer generation time of 33 years. We then estimated more recent demographic changes (from the present to ∼9,000 years ago) using IBDNe28 with the minimum-sample-size data set. We used IBDseq42 to detect IBD segments in sequence data from chromosome 2 in all populations. We then used IBDNe with the default parameters and a minimum IBD segment length of 2 centiMorgan (cM) units. We assumed a generation time of 29 years.

Isolation index

In order to quantify the different isolation levels of different isolates, we developed an index that combines three demographic parameters: (a) Tdg, (b) Ne and (c) the level of private isolate ancestry (M). We call this estimate the Isolation index (Isx). It is defined as: Both Tdg and Ne were inferred from the LD-based method using the NeON R package27. M is difficult to estimate directly from SNP genotype data, so here we estimated the difference of shared ancestral components between an isolate and its general population from ADMIXTURE analysis. We ran ADMIXTURE with only one isolate and it closest general population using K=2. We then estimated the difference in the means of ancestry between the isolate and its general population. The M parameter was defined as Delta Ancestry.

Rxy analysis

Rxy statistics29 between each pair of populations (an isolate and its closest general population) for different functional categories were calculated using the matched-sample-size data for missense and LoF variants, including stop gain, splice donor and acceptor variants, using synonymous variants as controls (we did not use intragenic variants as control because of the ascertainment in the ITG which has high-depth exome sequences and low depth for the rest of the genome). We also calculated Rxy statistics for variants with CADD scores43 greater than 10 and 20, using variants with CADD scores less than 5 as controls. The mean and s.d. for each Rxy value were obtained from 100 bootstraps.

DVxy analysis

A new statistic, DVxy, was developed to quantify the enrichment of low-frequency functional variants in the isolates using both the matched-sample-size and minimum-sample-size data sets. It calculates the proportion of functional variants in each isolate compared with its general population, correcting for genetic drift at the same time. We calculated DVxy specifically for the subset of variants with DAF 2–5% in the isolate, and at least three times lower in its closest general population, or vice-versa. We called these variants ‘drifted variants’ (DV). DVxy was calculated for both coding regions and whole genomes. For coding variants, we defined missense or missense plus LoF variants as functional variants. We counted the number of functional DVs and neutral (intergenic) DVs in each isolate (population x) and the corresponding general population (population y). The ratio between the fraction of DV variants from the isolated population (corrected by the count of intergenic variants) and the corresponding fraction of DV variants from its general population was defined as the DVxy statistic. If DVxy is equal to 1, there is no enrichment for the functional DVs in the isolate; less than 1 indicates depletion, and greater than 1 indicates enrichment. For the whole genome, we used different CADD score cut-offs and bins. We calculated a DV statistic by stratifying the variants according to their CADD scores (0–5, neutral variants; 5–10, mildly deleterious; 10–20, deleterious; and greater than 20, highly deleterious) for each isolate and its closest general population. We finally calculated a ratio of the fraction of DV variants (from each class) between the isolate and its general population, and vice-versa. The following formula shows the DVxy-wg calculation for variants with CADD score between i and j in an isolate and its general population. The 95% confidence interval for each calculation was obtained by randomly sampling data from 20 chromosomes 100 times.

SVxy analysis

We further investigated the relaxation of purifying selection in the isolated populations using SVs. Here, we also used the minimum-sample-size data set. Another new statistic, SVxy, was developed to measure the ratio of missense versus synonymous singletons per gene in each population, as well as the ratio of the sum of singletons in all genes which have at least one singleton in the pair of the populations (one isolate and one general population). We counted the number of missense singletons and synonymous singletons per gene in each population, and SVgene was calculated as: SVgene>1 indicates relaxation of purifying selection; SVgene=1 indicates neutrality; and SVgene<1 indicates purifying selection. We then divided the gene list into essential genes30 and non-essential genes (the rest), and calculated a statistic, G, for each population, defined as: G=percentage of essential genes with SVgene>1/percentage of non-essential genes with SVgene>1 We finally calculated a statistic, SVxy, which is the ratio of SVpop of each isolate to SVpop of its general population. SVpop for each isolate and its general population was calculated using all genes which have at least one singleton in the pair of the populations and defined as SVpop=Σ (SV missense counts)/Σ(SV synonymous counts). We used the same annotation as in the variant counts. We calculated a confidence interval for each estimate using bootstrapping of 80% of the genes 100 times.

Correlation analyses

We calculated pair-wise correlation coefficients between the Isx values, population-genetic measurements ROH, F, FST, and number and length of LD blocks, as well as the newly developed statistics DVxy and SVxy using the Pearson correlation in R. We calculated genome-wide pairwise derived allele frequency differences (deltaDAF) for each pair of populations (an isolate and its general population) as described previously31 using the matched-sample-size data set. We also carried out PCAdapt analyses32 for each pair of populations using the whole data set. Both analyses look for high derived allele frequency variants in the isolates, and will not be affected by sample size. Finally, we ran the SDS method33 using the whole UKO and UKG data sets, which have the largest sample sizes for both isolate and its general population, and thus the greatest power for this method.

Data availability

Amalgamated genotype calls across all populations studied are available through the European Genome/Phenome Archive (EGAD00001002014) with Data Access Agreement described in Supplementary Information.

Additional information

How to cite this article: Xue, Y. et al. Enrichment of low-frequency functional variants revealed by whole-genome sequencing of multiple isolated European populations. Nat. Commun. 8, 15927 doi: 10.1038/ncomms15927 (2017). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

41 in total

1. Principal components analysis corrects for stratification in genome-wide association studies.

Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330

2. Detecting identity by descent and estimating genotype error rates in sequence data.

Authors: Brian L Browning; Sharon R Browning
Journal: Am J Hum Genet Date: 2013-10-24 Impact factor: 11.025

3. Genetic characterization of Greek population isolates reveals strong genetic drift at missense and trait-associated variants.

Authors: Kalliope Panoutsopoulou; Konstantinos Hatzikotoulas; Dionysia Kiara Xifara; Vincenza Colonna; Aliki-Eleni Farmaki; Graham R S Ritchie; Lorraine Southam; Arthur Gilly; Ioanna Tachmazidou; Segun Fatumo; Angela Matchan; Nigel W Rayner; Ioanna Ntalla; Massimo Mezzavilla; Yuan Chen; Chrysoula Kiagiadaki; Eleni Zengini; Vasiliki Mamakou; Antonis Athanasiadis; Margarita Giannakopoulou; Vassiliki-Eirini Kariakli; Rebecca N Nsubuga; Alex Karabarinde; Manjinder Sandhu; Gil McVean; Chris Tyler-Smith; Emmanouil Tsafantakis; Maria Karaleftheri; Yali Xue; George Dedoussis; Eleftheria Zeggini
Journal: Nat Commun Date: 2014-11-06 Impact factor: 14.919

4. A global reference for human genetic variation.

Authors: Adam Auton; Lisa D Brooks; Richard M Durbin; Erik P Garrison; Hyun Min Kang; Jan O Korbel; Jonathan L Marchini; Shane McCarthy; Gil A McVean; Gonçalo R Abecasis
Journal: Nature Date: 2015-10-01 Impact factor: 49.962

5. A rare functional cardioprotective APOC3 variant has risen in frequency in distinct population isolates.

Authors: Ioanna Tachmazidou; George Dedoussis; Lorraine Southam; Aliki-Eleni Farmaki; Graham R S Ritchie; Dionysia K Xifara; Angela Matchan; Konstantinos Hatzikotoulas; Nigel W Rayner; Yuan Chen; Toni I Pollin; Jeffrey R O'Connell; Laura M Yerges-Armstrong; Chrysoula Kiagiadaki; Kalliope Panoutsopoulou; Jeremy Schwartzentruber; Loukas Moutsianas; Emmanouil Tsafantakis; Chris Tyler-Smith; Gil McVean; Yali Xue; Eleftheria Zeggini
Journal: Nat Commun Date: 2013 Impact factor: 14.919

6. Genome scans for detecting footprints of local adaptation using a Bayesian factor model.

Authors: Nicolas Duforet-Frebourg; Eric Bazin; Michael G B Blum
Journal: Mol Biol Evol Date: 2014-06-03 Impact factor: 16.240

7. Demography and the age of rare variants.

Authors: Iain Mathieson; Gil McVean
Journal: PLoS Genet Date: 2014-08-07 Impact factor: 5.917

8. Distribution and medical impact of loss-of-function variants in the Finnish founder population.

Authors: Elaine T Lim; Peter Würtz; Aki S Havulinna; Priit Palta; Taru Tukiainen; Karola Rehnström; Tõnu Esko; Reedik Mägi; Michael Inouye; Tuuli Lappalainen; Yingleong Chan; Rany M Salem; Monkol Lek; Jason Flannick; Xueling Sim; Alisa Manning; Claes Ladenvall; Suzannah Bumpstead; Eija Hämäläinen; Kristiina Aalto; Mikael Maksimow; Marko Salmi; Stefan Blankenberg; Diego Ardissino; Svati Shah; Benjamin Horne; Ruth McPherson; Gerald K Hovingh; Muredach P Reilly; Hugh Watkins; Anuj Goel; Martin Farrall; Domenico Girelli; Alex P Reiner; Nathan O Stitziel; Sekar Kathiresan; Stacey Gabriel; Jeffrey C Barrett; Terho Lehtimäki; Markku Laakso; Leif Groop; Jaakko Kaprio; Markus Perola; Mark I McCarthy; Michael Boehnke; David M Altshuler; Cecilia M Lindgren; Joel N Hirschhorn; Andres Metspalu; Nelson B Freimer; Tanja Zeller; Sirpa Jalkanen; Seppo Koskinen; Olli Raitakari; Richard Durbin; Daniel G MacArthur; Veikko Salomaa; Samuli Ripatti; Mark J Daly; Aarno Palotie
Journal: PLoS Genet Date: 2014-07-31 Impact factor: 5.917

9. The UK10K project identifies rare variants in health and disease.

Authors: Klaudia Walter; Josine L Min; Jie Huang; Lucy Crooks; Yasin Memari; Shane McCarthy; John R B Perry; ChangJiang Xu; Marta Futema; Daniel Lawson; Valentina Iotchkova; Stephan Schiffels; Audrey E Hendricks; Petr Danecek; Rui Li; James Floyd; Louise V Wain; Inês Barroso; Steve E Humphries; Matthew E Hurles; Eleftheria Zeggini; Jeffrey C Barrett; Vincent Plagnol; J Brent Richards; Celia M T Greenwood; Nicholas J Timpson; Richard Durbin; Nicole Soranzo
Journal: Nature Date: 2015-09-14 Impact factor: 49.962

10. Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers.

Authors: Carlo Sidore; Fabio Busonero; Andrea Maschio; Eleonora Porcu; Silvia Naitza; Magdalena Zoledziewska; Antonella Mulas; Giorgio Pistis; Maristella Steri; Fabrice Danjou; Alan Kwong; Vicente Diego Ortega Del Vecchyo; Charleston W K Chiang; Jennifer Bragg-Gresham; Maristella Pitzalis; Ramaiah Nagaraja; Brendan Tarrier; Christine Brennan; Sergio Uzzau; Christian Fuchsberger; Rossano Atzeni; Frederic Reinier; Riccardo Berutti; Jie Huang; Nicholas J Timpson; Daniela Toniolo; Paolo Gasparini; Giovanni Malerba; George Dedoussis; Eleftheria Zeggini; Nicole Soranzo; Chris Jones; Robert Lyons; Andrea Angius; Hyun M Kang; John Novembre; Serena Sanna; David Schlessinger; Francesco Cucca; Gonçalo R Abecasis
Journal: Nat Genet Date: 2015-09-14 Impact factor: 38.330

21 in total

1. Characterization of Exome Variants and Their Metabolic Impact in 6,716 American Indians from the Southwest US.

Authors: Hye In Kim; Bin Ye; Nehal Gosalia; Çiğdem Köroğlu; Robert L Hanson; Wen-Chi Hsueh; William C Knowler; Leslie J Baier; Clifton Bogardus; Alan R Shuldiner; Cristopher V Van Hout
Journal: Am J Hum Genet Date: 2020-07-07 Impact factor: 11.025

2. Understanding the Hidden Complexity of Latin American Population Isolates.

Authors: Jazlyn A Mooney; Christian D Huber; Susan Service; Jae Hoon Sul; Clare D Marsden; Zhongyang Zhang; Chiara Sabatti; Andrés Ruiz-Linares; Gabriel Bedoya; Nelson Freimer; Kirk E Lohmueller
Journal: Am J Hum Genet Date: 2018-10-25 Impact factor: 11.025

3. Mutations in L-type amino acid transporter-2 support SLC7A8 as a novel gene involved in age-related hearing loss.

Authors: Meritxell Espino Guarch; Mariona Font-Llitjós; Isabel Varela-Nieto; Paolo Gasparini; Manuel Palacín; Virginia Nunes; Silvia Murillo-Cuesta; Ekaitz Errasti-Murugarren; Adelaida M Celaya; Giorgia Girotto; Dragana Vuckovic; Massimo Mezzavilla; Clara Vilches; Susanna Bodoy; Ignasi Sahún; Laura González; Esther Prat; Antonio Zorzano; Mara Dierssen
Journal: Elife Date: 2018-01-22 Impact factor: 8.140

4. Cohort-wide deep whole genome sequencing and the allelic architecture of complex traits.

Authors: Arthur Gilly; Daniel Suveges; Karoline Kuchenbaecker; Martin Pollard; Lorraine Southam; Konstantinos Hatzikotoulas; Aliki-Eleni Farmaki; Thea Bjornland; Ryan Waples; Emil V R Appel; Elisabetta Casalone; Giorgio Melloni; Britt Kilian; Nigel W Rayner; Ioanna Ntalla; Kousik Kundu; Klaudia Walter; John Danesh; Adam Butterworth; Inês Barroso; Emmanouil Tsafantakis; George Dedoussis; Ida Moltke; Eleftheria Zeggini
Journal: Nat Commun Date: 2018-11-07 Impact factor: 14.919

5. Inter-individual genomic heterogeneity within European population isolates.

Authors: Paolo Anagnostou; Valentina Dominici; Cinzia Battaggia; Alessandro Lisi; Stefania Sarno; Alessio Boattini; Carla Calò; Paolo Francalacci; Giuseppe Vona; Sergio Tofanelli; Miguel G Vilar; Vincenza Colonna; Luca Pagani; Giovanni Destro Bisol
Journal: PLoS One Date: 2019-10-09 Impact factor: 3.240

Review 6. Genomic Predictors of Asthma Phenotypes and Treatment Response.

Authors: Natalia Hernandez-Pacheco; Maria Pino-Yanes; Carlos Flores
Journal: Front Pediatr Date: 2019-02-05 Impact factor: 3.418

7. Exome sequencing of Finnish isolates enhances rare-variant association power.

Authors: Adam E Locke; Karyn Meltz Steinberg; Charleston W K Chiang; Susan K Service; Aki S Havulinna; Laurel Stell; Matti Pirinen; Haley J Abel; Colby C Chiang; Robert S Fulton; Anne U Jackson; Chul Joo Kang; Krishna L Kanchi; Daniel C Koboldt; David E Larson; Joanne Nelson; Thomas J Nicholas; Arto Pietilä; Vasily Ramensky; Debashree Ray; Laura J Scott; Heather M Stringham; Jagadish Vangipurapu; Ryan Welch; Pranav Yajnik; Xianyong Yin; Johan G Eriksson; Mika Ala-Korpela; Marjo-Riitta Järvelin; Minna Männikkö; Hannele Laivuori; Susan K Dutcher; Nathan O Stitziel; Richard K Wilson; Ira M Hall; Chiara Sabatti; Aarno Palotie; Veikko Salomaa; Markku Laakso; Samuli Ripatti; Michael Boehnke; Nelson B Freimer
Journal: Nature Date: 2019-07-31 Impact factor: 49.962

8. Genomic Analyses of Human European Diversity at the Southwestern Edge: Isolation, African Influence and Disease Associations in the Canary Islands.

Authors: Beatriz Guillen-Guio; Jose M Lorenzo-Salazar; Rafaela González-Montelongo; Ana Díaz-de Usera; Itahisa Marcelino-Rodríguez; Almudena Corrales; Antonio Cabrera de León; Santos Alonso; Carlos Flores
Journal: Mol Biol Evol Date: 2018-12-01 Impact factor: 16.240

9. Exploring rare and low-frequency variants in the Saguenay-Lac-Saint-Jean population identified genes associated with asthma and allergy traits.

Authors: Andréanne Morin; Anne-Marie Madore; Tony Kwan; Maria Ban; Jukka Partanen; Lars Rönnblom; Ann-Christine Syvänen; Stephen Sawcer; Hendrik Stunnenberg; Mark Lathrop; Tomi Pastinen; Catherine Laprise
Journal: Eur J Hum Genet Date: 2018-09-11 Impact factor: 4.246

10. An actionable KCNH2 Long QT Syndrome variant detected by sequence and haplotype analysis in a population research cohort.

Authors: Shona M Kerr; Lucija Klaric; Mihail Halachev; Caroline Hayward; Thibaud S Boutin; Alison M Meynert; Colin A Semple; Annukka M Tuiskula; Heikki Swan; Javier Santoyo-Lopez; Veronique Vitart; Chris Haley; John Dean; Zosia Miedzybrodzka; Timothy J Aitman; James F Wilson
Journal: Sci Rep Date: 2019-07-29 Impact factor: 4.379