Literature DB >> 27428751

Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery.

Eric M Scott^1,2,3, Anason Halees⁴, Yuval Itan⁵, Emily G Spencer^1,2,3, Yupeng He^1,2,3, Mostafa Abdellateef Azab^1,2,3, Stacey B Gabriel⁶, Aziz Belkadi^7,8, Bertrand Boisson^5,7,8, Laurent Abel^5,7,8, Andrew G Clark⁹, Fowzan S Alkuraya^10,11, Jean-Laurent Casanova^1,5,7,8,12, Joseph G Gleeson^1,2,3.

Abstract

The Greater Middle East (GME) has been a central hub of human migration and population admixture. The tradition of consanguinity, variably practiced in the Persian Gulf region, North Africa, and Central Asia, has resulted in an elevated burden of recessive disease. Here we generated a whole-exome GME variome from 1,111 unrelated subjects. We detected substantial diversity and admixture in continental and subregional populations, corresponding to several ancient founder populations with little evidence of bottlenecks. Measured consanguinity rates were an order of magnitude above those in other sampled populations, and the GME population exhibited an increased burden of runs of homozygosity (ROHs) but showed no evidence for reduced burden of deleterious variation due to classically theorized 'genetic purging'. Applying this database to unsolved recessive conditions in the GME population reduced the number of potential disease-causing variants by four- to sevenfold. These results show variegated genetic architecture in GME populations and support future human genetic discoveries in Mendelian and population genetics.

Entities: Chemical

Mesh：

Substances：
Genetic Markers

Year: 2016 PMID： 27428751 PMCID： PMC5019950 DOI： 10.1038/ng.3592

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

The Greater Middle East (GME), loosely defined as a large swath of Arab and non-Arab Muslim countries from Morocco in the west to as far east as Pakistan [5], is home to approximately 10% of the world’s population. Despite its invaluable contribution to our understanding of the genetic causes of inherited conditions, especially recessive conditions, and its critical hub as a crossroad to early civilizations, genetic architecture and extent of rare genetic variation remains poorly defined [6-8]. To address this shortcoming, the GME Variome Consortium collected whole-exome data on 1,794 self-reported nationals from GME regions participating in on-going genetics studies. In order to minimize selection bias or overrepresentation of disease alleles, we selected primarily healthy individuals from families, and wherever possible, removed from datasets the allele that brought the family to medical attention. Samples were jointly processed, and filtered for quality and familial relation, leaving 1,111 high-quality unrelated individuals. We grouped the 1,111 GME exomes into six different GME subregions: Northwest Africa (NWA, 85 samples), Northeast Africa (NEA, 423 samples), Turkish Peninsula (TP, 140 samples), Syrian Desert (SD, 81 samples), Arabian Peninsula (AP, 214 samples), and Persia and Pakistan (PP, 168 samples) (Fig. S1, Table S1), which represent historic groupings, then compared with exomic data of nine established continental populations from 1000 Genomes (1000G) [9]. Unbiased identity-by-state clustering showed that samples largely grouped according to the location of ascertainment, validating grouping criteria (Fig. S2). To evaluate GME genetic substructure, we ran the unsupervised algorithm ADMIXTURE [10], where K=6 clusters minimized cross-validation error (Fig. S3). We found some overlap with the primary admixture components from Africa, Europe and East Asia at the edges of geography, but also a large proportion not found in previous reference samples (Figs. 1a, S4). The admixture results also aligned with publications reporting common variation [11-13].

Figure 1

Greater Middle East Variome as a hub of human genetics

a. Map of GME sub-regions. Lines define borders for admixture analysis from East Asia, Europe, Sub-Saharan Africa and the novel GME contribution (NWA: Northwest Africa, NEA: Northeast Africa, TP: Turkish Peninsula, SD: Syrian Desert, AP: Arabian Peninsula, PP: Persia and Pakistan). Pie charts: admixture proportions of 1000 Genomes Project (1000G) continental populations according to K=6 clusters.

b. Global ancestry proportions (K=6) for 1000G control populations with three distinct sources of contribution. 1000G population contributions: Africa (red), Europe (green) and East Asia (blue). GME populations from west to east: NWA (purple), AP (orange), and PP (yellow) derived from the GME.

c. TreeMix phylogeny of GME along with 1000G controls representing population divergence patterns. Length of the branch proportional to population drift. GME populations grouped around the African branch, but showed a substantial divergence. YRI: Yoruba in Ibadan, LWK: Luhya in Webuye Kenya, FIN: Finnish, GBR: Great Britain, TSI: Toscani, CHS: Southern Han Chinese, CHB: Han Chinese in Beijing, JPT: Japanese in Tokyo.

d. Wright’s Fixation Index (Fst) values for all pairs of GME and 1000G European populations, showing a smaller distance between GME and European populations compared with Sub-Saharan African populations. Greatest Fst value between any two GME populations was 0.026 (i.e. a quarter of the distance between FIN and JPT).

The least admixed samples were found in NWA, AP, and PP, suggesting these were founder populations, but showed inter-regional variation of GME-specific components suggesting local admixture (Fig. 1b), and potentially supporting historic events. The NWA component was found from west to east across North Africa, likely representing the presence of Berber genetic background [14]. The AP component likely represented ancestral Arab populations and was observed in nearly all regions, possibly a result of the Arab conquests of the 7th century coincident with the expansion of the Arabic language now spoken over much of the region. Similarly, the Persian expansion into TP, SD, and parts of NEA in the 5th century was the most likely contributor of PP signal. Additional sources of human heterogeneity derive from ancient introgression. We found similar patterns of Neanderthal introgression across all GME populations with the exception of NWA, which clustered closer to Sub-Saharan Africans (Fig. S5) [15-17]. These data supports the reduced Neanderthal introgression observed in native African populations. Patterns of human migration and drift were recapitulated using TreeMix among GME subregions, based upon 1000G control populations (Fig. 1c) [18]. The inferred tree with no migration showed tight clusters of European and Asian populations, but much larger apparent divergence among GME regions. The ordering of GME subregions from the root corroborated much of the ‘out-of-Africa’ ordering of subsequent founder populations [13]. Within the GME, the distance from the root emulated the west-to-east organization of GME samples, with PP showing the largest inferred drift parameter, supporting a west-to-east trajectory of human migrations. Assessment of Wright’s fixation index (Fst) demonstrated that the GME grouped with European populations, agreeing with TreeMix results. This resulted in three distinct clusters with a low degree of differentiation (Figs. 1d, S6). PP and NWA represented the extremes of the identified subregions, and showed the highest degree of differentiation (Fst = 0.026) (>2x compared to the distance between Finnish (FIN) and Toscani (TSI) but smaller than intercontinental comparisons). Of the four measured 1000G European populations, GME Fst measurements were closest to TSI, especially SD and TP, consistent with higher levels of European admixture in these populations. Despite the contribution of admixture, these values suggested extended periods of isolation relative to 1000G populations within each subregion. Inter-subregion relationships were tested using principal component analysis (PCA). As expected, the first two PCs separated along well-established geographic axes: PC1 separated Sub-Saharan Africans from all other populations, and PC2 separated Eurasian populations (Fig. S7). GME sub-regions fell between the 1000G African, East Asian, and European, supporting recent admixture. PP and TP were closer to East Asian, while NEA, NWA, and SD were closer to Sub-Saharan Africans. PC3 and PC4 separated samples along topographical north-south and east-west gradients, while exhibiting largely distinct but overlapping groups with a high-degree of inter-region diversity (Fig. 2a).

Figure 2

Wide diversity and high inbreeding coefficients in GME substructure

a. Principal component analysis (PCA) for individuals from GME and 1000G populations. Individuals projected along PC3 and PC4 axes. Persia and Pakistan (PP), Northwest Africa (NWA) and Europe defined the limits from right, left, and top, as coinciding with geography. Arab Peninsula (AP) defined the bottom limit, and was closest to Northeast Africa (NEA) and Syrian Desert (SD).

b. GME populations had increased rates of linkage disequilibrium decay compared to 1000G European and East Asian populations. Mean variant correlations (r2) shown for each 1,000 basepair (bp) bin from 1,000–70,000 bp.

c. Inbreeding coefficient (F) distributions for GME and 1000G populations. GME populations (purple) showed elevated F values, consistent with increased rates of consanguineous marriages. Box plots show median (horizontal line), 25%ile (45° angle), 75%ile (90° angle), minimum and maximum observations (whiskers).

d. F distributions for family structures for GME and European American (EA) trios. Mean F values correlated with expected for consanguineous offspring. Unk=unknown.

To test if these populations were subject to bottlenecks, we calculated the mean linkage disequilibrium (LD) decay, as haplotypes should decay as a function of size more slowly with increased bottleneck (Fig. 2b). LD for each GME population decayed faster than European and East Asian but slower than African populations. LD decayed faster in NWA and NEA compared with other GME regions, in agreement with our TreeMix results. Diverse patterns of admixture across these regions suggested these trends were not predominantly due to intermixing, but instead argued for a historic common ancient bottleneck. Between 20–50% of all GME marriages are consanguineous (compared with < 0.2% in the Americas and Western Europe) [1-3], with the majority being first cousin. This roughly 100X higher rate of consanguinity has correlated with roughly a doubling of the rate of recessive Mendelian disease [19,20]. European, African and East Asian 1000G populations all had distributions of estimated inbreeding coefficients (F) ~0.005, whereas GME F values ranged from 0.059 to 0.098, but with high variance within each population (Fig. 2c). Thus, measured F was ~10–20X higher, reflecting the shared blocks common to all human populations. F values were dominated by immediate family structure rather than historic or population-wide data trends (Fig. S8) [21]. Examining the larger set of 1,794 exomes that included many parent-child trios also showed an overwhelming influence of immediate family structure, in which offspring from first-cousin marriages displayed higher F values compared with non-consanguineous marriages (Fig. 2d). We expected that higher F values would correlate with an increased burden and length of ‘runs of homozygosity’ (ROH), defined as homozygous haplotypes as a function of length [22]. 1000G sub-Saharan Africa displayed the smallest total ROH as expected [23], whereas the two other 1000G assessed populations were relatively similar to GME (Figs. 3a, S9), probably reflecting similar lengths of short (<0.515 Mb) and medium (0.516–1.606 Mb) ROH. Most striking was the increase in long ROH (>1.607 Mb), found nearly exclusively in GME samples, especially for those over 4 Mb (Fig. 3b). In the GME, there was an enrichment of rare and very rare variants (AF <.05, and AF <.01) in longer ROH, and of common variants (AF >= .05) in shorter ROH (Fig. 3c), suggesting that the longer ROH result from recent consanguinity [24]

Figure 3

Distributions of short and long Runs of Homozygosity (ROH) correlates with patterns of bottlenecks and recent consanguinity

a. Sample burdens of ROH grouped by length (Short: <0.155 Mb, Medium: 0.156–1.606 Mb, Long: >1.607 Mb). GME samples (purple) showed a unique contribution of long ROH compared with other populations (*), with less in short and medium bins compared to Europe and East Asia. Total ROH in GME sub-regions overlapped with European and East Asian likely due to greater bottlenecks in these populations.

b. Histograms of long ROH for GME, Africa, Europe, and East Asia. GME samples more frequently harbored runs >4 Mb compared to other populations. ROH >15 Mb are binned together (* peak unique to Middle East).

c. Longer GME ROH spans were enriched for rare variation, while shorter runs were enriched for more common variation. Proportion of variants binned by allele frequency for different sized ROH, binned by 0.5 Mb intervals. Probability density function calculated for each allele frequency class. Note that AFs for common alleles declined whereas AFs for rare and very rare alleles rose as ROH increased in size (Common: AF > .05, Rare: AF 0.05–0.01, Very Rare: AF < 0.01).

This increased ROH provided an opportunity to identify homozygous loss-of-function variants (LOF) in healthy humans. While these variants are only putatively LOF until experimentally verified, these exhibit the strongest signs of selective pressure and are the first checked as disease candidates[25]. Recently, among 2,636 sequenced and 101,584 chip-imputed Icelanders, 1,171 genes were predicted to be inactivated [26]. From our 354 exomes on verified healthy adults, we found 301 genes with rare homozygous putative LOF variants (Table S2, S3), with only 50 genes overlapping with the Icelandic gene list. Similarly the ExAC dataset on 60,706 sequenced individuals identified 2068 genes inactivated, of which only 94 genes overlapped with our 301 genes. This suggests that the set of non-clinically relevant LOF variants is far from being exhausted. The GME represents an optimal population from which to identify homozygous variants due to the elevated consanguinity rates. Darwin observed that rare self-fertilized orchid strains exhibited surprisingly higher fitness than founder strains, which he termed ‘hero strains’ [27]. This led to the concept of ‘purging of recessive alleles’ by Haldane [28], referring to increased loss of deleterious alleles due to increased selective pressure in inbred populations. Purging was hypothesized to impact the GME genome due to the higher rates of birth defects incompatible with future reproduction [29], but has yet to be documented in humans. We compared the distribution of derived allele frequencies (DAF) in GME and 1000G populations [30]. Variants were divided into 7 functional and PolyPhen-2 deleterious classes. We calculated mean DAFs using chimpanzee (PanTro2) as the common ancestor (Figs. S10, S11) [31]. Neither autosomal nor X-linked variants showed significant differences (Fig. S12), arguing against a measurable effect on overall variant burden resulting from consanguinity. Numerous studies have relied on the increased power of GME-resident consanguineous families to identify causes of recessive disease, but the lack of an accessible variome has hindered progress. Efforts like the NHLBI GO Exome Sequencing Project (ESP) produced variomes for European American (EA) and African American (AA) populations, but poor correlation of DAFs between population pairs determined that neither were good estimators for GME DAFs (Pearson’s r 0.7979 GME vs. EA, 0.385 GME vs. AA, 0.1447 EA vs. AA, Figs. 4a, S13). Moreover, we found much of the GME variation to be poorly represented outside the GME (Fig. 4b), with the majority of variants in the rarest DAF bin found only in the GME.

Figure 4

GME Variome facilitates the discovery of Mendelian disease genes

a–b. Comparison of rare derived allele frequencies (DAF) between GME and Exome Sequencing Project (ESP). AA: African American, EA: European-American. Hexagonal bins shaded by log number of variants within each bin. Pearson’s r suggests GME DAFs were not accurately estimated by AA or EA populations.

b. The majority of variants in the rarest DAF bins were unique to the GME. AA: found only in GME and AA. EA: found only in GME and EA. All: found in GME, EA and AA. GME Unique: found only in GME.

c. Change in per-individual burden of eight variant classes as a function of increasing the number of individuals incorporated into the GME Variome cohort. As sample size increased there was a drop in the number of unique variants, along with more accurate estimation of DAFs for rare variants. Bootstraps were sampled with replacement for 100 iterations to calculate standard errors. “High impact”: variants meeting predicted deleteriousness thresholds (see Methods).

d. Number of candidate variants for 20 families, meeting segregation and deleteriousness filtering criteria, using DAFs derived from Hereditary Spastic Paraplegia (HSP)-only families (top) or also incorporating the GME Variome (bottom). Single, Duo, Trio: families with one, two or three affected members. Colors: number of individuals sharing the variant. “0”: no other individuals carried the allele, etc. Analysis was performed using this threshold for the number of individuals sharing alleles (0,1,2,3). Note drop in number of segregating variants for any given family after the GME Variome was applied.

In order to assess how well the GME Variome captured extant exome variation, we sub-sampled the cohort for 100 iterations from 5 to 700 individuals, and for 8 variant classes (see methods, Fig. 4c). There was decay in the number of unique variants and accumulation of rare variants as sample size increased, due to a scaled ability to estimate prevalence. When sampled near 1,000 individuals, the change in mean of these values was negligible as new samples were added. Thus the GME Variome should allow accurate determination of population-level DAFs for all but the rarest alleles. In order to investigate the potential of the Variome to expedite the discovery of new disease genes, we compared causal variant sets from GME families displaying recessive hereditary spastic paraplegia (HSP), where we recently established 17 new genetic forms of disease [32]. For a disease like recessive HSP with a prevalence of 3–10 per 100,000 [33] and where there are more than 40 genetic forms and hundreds of individual genetic mutations known, the expected allele frequency for any causative mutation should be <1:1000 (see methods). Select individuals from 20 representative families underwent whole exome sequencing. For each family, we calculated the number of alleles that passed standard filtering (i.e. LOF or otherwise potential ‘high impact’) [34] and were unique, both without and with the DAFs of the Variome (see website for Variome below). Using only exome data from the 20 families and public sources, there were on average 56, 20, and 11 unique variants passing filters from families with one, two or three sequenced affected members, respectively. In contrast, by accessing the Variome there were on average 13, 5, and 4 unique variants (Fig. 4d, Table S4), yielding a 4–7-fold reduction of the number of variants requiring further consideration. Loosening the allowable AFs to <1:500, <1:333 or <1:250 also showed substantial reduction in the number variants for consideration. Here we have interrogated the fine-scale genomic structure across the GME, shaped by prehistoric as well as historic migrations, conquests, and cultural traditions. The degree of unique genetic variation represented in the GME was surprising given previous efforts to capture diversity, and speaks to the value of sampling of understudied populations. The data support records of migrations and conquests, but also suggest a previously unstudied GME contribution. Despite millennia of elevated consanguinity in the GME, we detected no evidence for purging of recessive alleles. Instead, we detected large rare homozygous blocks, distinct from the small homozygous blocks found in other populations, supporting recent consanguineous matings, and allowing identification of genes harboring putatively high impact homozygous variants in healthy humans from this population. Applying the Variome to future sequencing projects for GME-originating subjects could aid in recessive gene identification across all classes of disease. GME Variome is a publicly accessible resource that will facilitate a broad range of genomic studies in the GME and globally.

Online Methods

1. Definition of the Greater Middle East

The term “greater Middle East” has been used to refer to a large swath of Arab and non-Arab Muslim countries, stretching from Morocco in the west to as far east as Pakistan in southeast Asia. However, no precise listing of designated countries has yet emerged. “U.S. Working Paper for G8 Sherpas,” Al-Hayat, February 13, 2004. Available online at [http://english.daralhayat.com/Spec/02-2004/Article-20040213-ac40bdaf-c0a8-01ed-004e-5e7ac897d678/story.html] and [https://www.fas.org/sgp/crs/mideast/RS22053.pdf]. Editable map of the Middle East was downloaded from [http://www.presentationmagazine.com].

2. Exome Resequencing

2.1 Study sample

The 2,497 individuals used in the analysis were selected from samples ascertained across three labs and recruited with the help of clinicians that constituted the GME Consortium. Although these individuals were not a random sample, they were ascertained within a wide variety of distinct phenotypes such that cohort-specific effects were not expected to bias patterns of variation. All study participants in each of the component studies provided written informed consent for the use of their DNA in studies aimed at identifying genetic risk variants for disease, and for broad data sharing. Institutional certification was obtained for each sample to allow deposition of genotype data in dbGaP and other purposes.

2.2 Exome resequencing, variant calling, and filtering

Blood DNA was extracted using Qiagen reagents, subjected to exome capture with the Agilent SureSelect Human All Exome 50 Megabase (Mb) kit, sequenced on an Illumina HiSeq2000 instrument, resulting in ~94% target coverage at > 30X depth [35-37]. FASTQ files were reprocessed and jointly called to minimize batch effects and ensure consistent variant calling. using the GATK pipeline (version 3.1–1) adhering to best practices [38], eliminating duplicate reads. Paired-end reads were aligned to the human reference genome NCBI Build 37, using BWA (version 0.7.5) [39]. Principal component analysis (PCA) was run on the resultant set of variants to identify potential batch effects between labs, sequencing centers, or collectively run groups of samples, then samples eliminated until no batch effects were observed. We calculated four quality control (QC) metrics for each sample using PSEQ and identified statistical outliers. Metrics included: total number of variants, transition/transversion ratio, number of sequenced positions, and number of singletons. Due to possible reference distance bias, we considered samples grouped by geographic region independently. Samples were identified as outliers using a cutoff of >5 standard deviations from the mean threshold for each QC metric, removing 314 samples. The PCA based outlier analysis algorithm from the EIGENSOFT software library was also run, but failed to find any additional samples violating a standard deviation threshold of 5.0 [40]. To ensure unbiased population structure statistics and allele frequency estimates, we removed close and cryptic relationships from the dataset. Kinship estimation was generated using KING, which calculated relatedness between all pairs of individuals and was robust to population structure [41]. Using the 182,967 LD filtered SNPs, we ran KING following standard guidelines for a 3rd degree relationship (i.e. first cousins), using a kinship coefficient of 0.04419. When a cluster of related individuals was identified, we preferentially removed those to leave the largest number of samples. Of the remaining 2183 samples after outlier filtering, 667 samples were removed to reduce dataset relatedness, leaving a final cohort of 1516 non-related individuals. Remaining samples were rerun through the KING, which identified no additional kinships. Final continental sample counts after filtering: Sub-Saharan Africa: 19, America: 33, Europe: 378, Oceania: 1, and Middle East: 1111. Coverage statistics were generated across all internal exome data sets using BEDTools, to calculate the average coverage across each exon [42]. Exons were filtered from the analysis if greater than 5% of samples had less than 10x average coverage. Out of the initial 192,056 exons targeted by the Agilent SureSelect II capture kit, 170,032 exons were well covered in at least 95% of samples. Variants were filtered if identified outside of these genomic regions, leaving 32,967,859 bases under consideration (~1% of the human genome) within 17,800 genes. Standard filters for variants that were called with posterior probability >99% (glfMultiples SNP quality > 20), were at least 5 bp away from an indel detected in the 1000G Pilot Project, were targeted in at least 95% individuals and had a total depth across samples between 6,823 to 6,823,000 (~1–1000 reads per sample on average) [9]. Variant positions were filtered based on population statistics including a ‘missingness’ rate (referring to the percent of samples where information was missing) of less than 5%, and Hardy-Weinberg equilibrium (HWE) deviation p-value < 0.00005 [43]. We generated a subset of variants in minimal linkage disequilibrium (LD) by pruning variants exhibiting pairwise linkage disequilibrium (r). Variants were filtered to exclude SNPs with minor allele frequency (MAF) <5%, and all indels. Remaining SNPs were pruned adhering to a maximum threshold of 0.5 using PLINK’s ‘--indep-pairwise’ command [43]. Of the initial 578,231 variants, 182,967 SNPs passed filters. This LD pruned dataset was used for population structure characterization including principal component analysis (PCA), Wright’s fixation index (Fst), admixture analysis, KING relationship testing, and estimation of inbreeding coefficient.

2.3 Geographic region assignment

Samples were recruited from 20 countries and territories across the GME and grouped into a set of six geographic regions: Northwest Africa (85 Samples), Northeast Africa (423 Samples), Arabian Peninsula (214 Samples), Syrian Desert (81 Samples), Turkish Peninsula (140 Samples), and Persia and Pakistan (168 Samples). Country boundaries were not used to group samples for two reasons: 1] Inconsistent sampling left several countries with too few samples to accurately represent the diversity of the population. Syria and Yemen, for example, were only represented by a few samples, due to ongoing conflicts. 2] Current country borders frequently fail to accurately separate ethnicities, due to a combination of recent migrations and recent political history. For example, south-eastern Arabian Peninsula Bedouin tribes do not distinguish between the relatively recently defined borders of Oman and the UAE. Self-identified ethnicities were available for some samples, but incompleteness of this annotation, and the great diversity of populations affiliating as “Arab”, prompted use of geography for groupings. As much as possible we assigned location to the current residence, rather than ancestral residence or location where samples were drawn. While some reference GME ethnicities exist in public resources, such as the Human Genome Diversity Project (HGDP) [44], we found both the breadth of ascertained ethnicities and sample size insufficient to impute ethnicities where absent. The original cohort was largely composed of samples from GME countries, but also included samples of African, European, and East Asian decent. To ensure consistency in our geographic designations we performed and linkage clustering, based on pairwise distances between samples using Plink’s ‘--distance-matrix’ command [43]. We performed hierarchical clustering on all samples using Ward’s hierarchical clustering method (“ward.D2” option for the “hclust” algorithm in R)[45].

3. Population Structure of GME

3.1 Data integration

Population structure was analyzed in the context of continental populations from the 1000 Genomes Phase I (1000G) dataset [9]. As 1000G samples were generated from a combination of whole genome and exome sequencing, variants falling outside of RefSeq exonic regions +/− 30 base pairs (bp) were filtered using BedTools and merged with the GME cohort [46,47]. Nine populations from 1000G data were used in comparative analyses: African populations YRI and LWK; East Asian populations CHB, CHS, and JPT; and European populations GBR, TSI, IBS, and FIN. Related 1000G samples were filtered by a KING analysis as previously described. A total of 1821 samples remained after filtering representing 15 geographic regions, 6 from the GME and 9 from 1000G.

3.2 Substructure analysis

To investigate the influence of admixture on the GME samples, we used the block relaxation algorithm implemented in ADMIXTURE to estimate individual ancestry proportions given K ancestral populations [48]. Unsupervised ADMIXTURE was run using default settings (folds=5) on merged GME and 1000G samples and iterations of K values from 2 to 14. Minimum squared error values calculated from ADMIXTURE’s cross-validation procedure for evaluating fit of different values of K, found an optimum K = 6 for just GME samples, and 7 including 1000G control data.

3.3 PCA and Wright’s Fixation Index (Fst)

Principal component analysis (PCA) was used to investigate the affinities within human populations and the relationships between them. We performed PCA on GME and 1000G samples using the SmartPCA tool from the EIGENSOFT software library and the first four principal components compared graphically [40,49]. Wright’s fixation index (Fst) was used to explore the degree of differentiation between populations. Fst values and standard error for all pairs of populations were calculated using the estimator of Weir & Cockerham, also included in the EIGENSOFT software library. All plots were generated using ggplot2 [50].

3.4 LD decay

Pairwise linkage disequilibrium among pairs of SNPs is an indicator of the past history of recombination and genetic drift. To calculate LD, we tallied pairwise r for SNP pairs for all GME and control populations using the Plink “r2” option [43]. Correlations between all SNPs falling within each sliding-window of 70 kilobase (kb) were calculated with no lower limit on r values. Pairwise correlations were binned by genomic distance between SNPs (up to 70kb), and averages calculated for each bin. Control samples followed expected patterns of LD decay.

3.5 Estimation of inbreeding

The inbreeding coefficient of an individual (F) was used to represent the probability that two randomly chosen alleles at a homologous locus within an individual were identical by descent (IBD) with respect to a base reference population in which all alleles were independent. While the true inbreeding coefficient of an individual is often unknown, several estimation methods have been shown to give a reasonable estimate. F estimates were calculated using the Plink “het” algorithm on LD pruned variants following authors guidelines [43]. We compared results to the HMM algorithm Festim [51] and found the two estimates were very similar (Pearson’s r: 0.874) but frequently Festim failed to return results for samples with missing data. Negative F values were most likely the result of either biased variant sampling, a high-degree interracial marriage, or due to recent intermixing of previously disparate populations[8].

3.6 Runs of homozygosity (ROH) estimation

To infer estimates of the autozygosity and relative recent population size, we estimated runs of homozygosity using the HMM algorithm H3M2 [52]. H3M2 was run directly on aligned BAM files, following authors recommendations for all parameters. Proportion of genome and exome falling within ROH was calculated for each sampling using BedTools. ROH length classes were based on published ranges [23], where the authors used machine learning to identify three ROH classes including: Short (<0.515 Mb), Medium (0.156–1.606 Mb), and Long (>1.607 Mb). We compared densities of ROH lengths from internal data and found a near identical distribution as the published values used to identify these classes.

4. Variant Annotation and Classification

4.1 Variant annotation

Functional annotation was performed for genetic purging and loss of function analyses. Variants were annotated using the ANNOVAR suite of scripts (version 2014Nov12) [53]. ANNOVAR classified variants into eight coding region functional groups including: “frameshift_deletion”, “frameshift_insertion”, “nonframeshift_deletion”, “nonframeshift_insertion”, “nonsynonymous_SNV”, “stopgain”, “stoploss”, and “synonymous_SNV”. Non-coding variants are classified as “unknown”. Splicing defects were identified based on 2 base pair distance from the splice junction, either on the intronic or exonic side. A predicted deleteriousness classification was generated for each missense variant using PolyPhen-2 [54]. The functional designations for PolyPhen-2 include: B (Benign), P (Possibly Damaging), D (Probably Damaging). We compared these annotations to those generated by SNPEff [55], and while there were some differences, found distributions of calls from each sample to be consistent.

4.2 Ancestral allele identification

We used the Chimpanzee genome as the closest assembled out-group genome. Ancestral allele estimates were obtained by UCSC pairwise alignments between human reference hg19 and chimp references PanTro2 and PanTro4. Systematic lookups for all GME and 1000G variants were performed using UCSC Genome Browser tools and custom scripts to identify associated chimpanzee alleles. We compared PanTro2 and PanTro4 to assess the difference in correcting the apparent reference bias, but found both worked equally well. Estimated ancestral alleles were used as the reference allele to calculate derived allele frequencies (DAF). DAFs were not calculated for variants where the ancestral allele was not present in the human germline.

4.3 Identity-by-state (IBS) distance to reference

To interrogate the potential biases that might result from reference selection we calculate the IBS distance between samples and multiple different references including hg19, and chimpanzee. The distance represents the proportion of positions that diverge from reference, and was calculated between all pairs of samples and references. The IBS distance, d, represented the number of differing alleles between the two samples divided by the total number of alleles compared. More formally, d, between the two n-length vectors p, q (in our case where p is the reference sample and q is the sample being compared) in a vector space v, where v ={0,1,2} encoding the homozygote for the human reference allele, the heterozygote, and the homozygote for the alternate allele, respectively. For any two samples, we calculate d as: where (p,q) are vectors such that p = (p1, p…, p)and q = (q1,q…,q) Each vector represented all genotype calls between the two samples, excluding filtered sites or missing positions. The IBS distance was calculated for all GME and 1000G samples against the hg19 and chimpanzee reference genomes. All genotypes from the merged VCF file were coded based on a comparison to the hg19 reference. Variant positions were filtered to remove indels, due to the possibility of alignment errors, and non-biallelic sites. When comparing to hg19, vector p was represented by a vector of zeros.

4.4 Hereditary Spastic Paraplegia (HSP) candidate variant analysis

Samples from 20 consanguineous families displaying an autosomal recessive inheritance pattern of HSP were selected from a previously analyzed cohort [32], selected from a total cohort of 55 families because in these 20 there was a single genetic causes identified. Families were analyzed in adherence to published methods [32]. Briefly, homozygous variants were filtered based on family structure to ensure variants segregated with the disease phenotype. We performed deleteriousness filtering using functional classes and GERP++ scores [56]. All candidate variants were potentially LOF (frameshift, stop, or perturbing splicing) or a coding variant with a GERP score >4. The maximum allele frequency for candidate variants were based on established rates of disease prevalence, estimated at 1:10,000 for clinical presentations classified of HSP [57]. Approximately 50% of HSP is autosomal dominant, and of the remaining, about 50% is explained by mutations by SPG11 [58], leaving only 1:40,000 with recessive HSP caused by other genes. At least 35 other genes are reported to cause recessive HSP. Thus, the contribution to HSP disease prevalence for any given gene is unlikely to be more than 1:1,000,000. While prevalence of HSP mutations is not expected to be uniform, we expect the maximum carrier frequency for any new causal variant to be no more than 1:1000, assuming full penetrance and a classic recessive inheritance, and is actuality is likely to be much rarer given allelic diversity. With roughly 1000 individuals in our cohort, we calculated that variants with DAFs <1:1000 should not be observed commonly in our dataset, AFs <1:500 should be not be observed in more than 1 individual, AFs <1:333 in not more than 2 individuals and AFs <1:250 in not more than 3 individuals. Variants passing deleteriousness and allele frequency thresholds were treated as candidates to calculate the usefulness of the Variome to limit the number of deleterious variants considered as candidates.

5. Testing for the Influence of Genetic Purging

Consanguinity has been practiced in the GME for at least several centuries [59]. Simulations of GME like populations have found sufficient time has past for purging to have been effective in reducing genetic load [29]. Clinical studies aimed at comparing clinical rates of birth defect rates, premature births or miscarriages, between communities that practice consanguinity to those largely out-breeding populations have found all metrics have fallen within range for the rate of immediate form of consanguinity [21,60,61]. More recent genetics studies investigating differential selective pressure across human populations focused on the role of population bottlenecks, neglecting the potential influence of consanguinity, and lacked representation from the GME [31,62]. For these reasons, we sought to investigate the possibility that genetic purging has influenced variant burden in the GME. In order to approach the question of variable selective pressure across human populations, we implemented a variation of the DAF comparison method [31]. We assumed that any change in the efficacy of natural selection should be evident across populations in the mean DAF within each variant classes. For all variants described across the GME and 1000G populations, we filtered for high quality calls, identified ancestral alleles (described in “Ancestral Allele” section), annotated variants for predicted function and PolyPhen-2 classes using ANNOVAR, down-sampled to achieve an equivalent numbers of chromosomes across populations, and calculated DAFs for all positions. Variants were grouped by class, and the DAF means were calculated for each population. Standard-errors were calculated by bootstrapping DAF means for 1000 iterations. Recent studies using PolyPhen-2 demonstrated a deflation of deleteriousness scores for derived variants found in the hg19 reference, likely due to a training artifact [31,62]. Before using PolyPhen-2 classes, this bias was corrected for all derived reference positions. Bias correction was implemented by grouping variants by DAF bins, and calculating the proportions of each PolyPhen-2 class per bin for ancestral reference positions. Using these proportions as expectations, and all derived reference positions were randomly reassigned a new PolyPhen-2 class based on a hypergeometric distribution within each DAF bin. DAF means across classes for all included 1000G and GME populations showed no deviation outside the standard-error for any two populations[61].

6. Neanderthal and Denisovan Introgression Analysis

Neanderthal-derived variants are often subjected to strong negative selection, thereby making exome analysis inadequate for estimating age of introgression. Thus we calculated the proportion observed between extant populations [15,63]. To estimate introgression in exome samples, we identified aligned consensus calls for all human variant positions from the chimpanzee, Neanderthal and Denisovan reference genomes. Alignments of Neanderthal and Denisovan genomes to 1000G variant positions were downloaded from the Max Planck Institute for Evolutionary Anthropology FTP [15,64]. Neanderthal and Denisovan alleles were identified from the hg19-ancestor alignment files. Chimpanzee alleles were identified as described in the “ancestral allele” section of these methods. We projected GME and 1000G control populations on the principal components calculated using representative samples from Neanderthal, Denisova, and chimpanzee [16,65,66], and aligned the human samples to these ancestral populations. Principal components were computed using R’s “prcomp” function (see web resources), and projected vectors were calculated for all 1000G and GME samples. Distance from the re-adjusted origin to each species reflected the proportion of introgression observed in each sample. The limited number of SNPs that were examined in this analysis compared to similar genotype-based analysis likely inflated the sampling variance within populations, and limited the sensitivity of our analysis to smaller introgression proportions. Centroids for all populations were labeled with their abbreviated names. Similar to previous work, Europeans, East Asians, and GME populations overlapped, and demonstrated larger proportions of Neanderthal than African populations [65-67].

62 in total

1. A human genome diversity cell line panel.

Authors: Howard M Cann; Claudia de Toma; Lucien Cazes; Marie-Fernande Legrand; Valerie Morel; Laurence Piouffre; Julia Bodmer; Walter F Bodmer; Batsheva Bonne-Tamir; Anne Cambon-Thomsen; Zhu Chen; J Chu; Carlo Carcassi; Licinio Contu; Ruofu Du; Laurent Excoffier; G B Ferrara; Jonathan S Friedlaender; Helena Groot; David Gurwitz; Trefor Jenkins; Rene J Herrera; Xiaoyi Huang; Judith Kidd; Kenneth K Kidd; Andre Langaney; Alice A Lin; S Qasim Mehdi; Peter Parham; Alberto Piazza; Maria Pia Pistillo; Yaping Qian; Qunfang Shu; Jiujin Xu; S Zhu; James L Weber; Henry T Greely; Marcus W Feldman; Gilles Thomas; Jean Dausset; L Luca Cavalli-Sforza
Journal: Science Date: 2002-04-12 Impact factor: 47.728

Review 2. Regional research priorities in brain and nervous system disorders.

Authors: Vijayalakshmi Ravindranath; Hoang-Minh Dang; Rodolfo G Goya; Hader Mansour; Vishwajit L Nimgaonkar; Vivienne Ann Russell; Yu Xin
Journal: Nature Date: 2015-11-19 Impact factor: 49.962

3. Principal components analysis corrects for stratification in genome-wide association studies.

Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330

4. H3M2: detection of runs of homozygosity from whole-exome sequencing data.

Authors: Alberto Magi; Lorenzo Tattini; Flavia Palombo; Matteo Benelli; Alessandro Gialluisi; Betti Giusti; Rosanna Abbate; Marco Seri; Gian Franco Gensini; Giovanni Romeo; Tommaso Pippucci
Journal: Bioinformatics Date: 2014-06-24 Impact factor: 6.937

5. Sequence variants in SLC16A11 are a common risk factor for type 2 diabetes in Mexico.

Authors: Amy L Williams; Suzanne B R Jacobs; Hortensia Moreno-Macías; Alicia Huerta-Chagoya; Claire Churchhouse; Carla Márquez-Luna; Humberto García-Ortíz; María José Gómez-Vázquez; Noël P Burtt; Carlos A Aguilar-Salinas; Clicerio González-Villalpando; Jose C Florez; Lorena Orozco; Christopher A Haiman; Teresa Tusié-Luna; David Altshuler
Journal: Nature Date: 2013-12-25 Impact factor: 49.962

6. Gene flow from North Africa contributes to differential human genetic diversity in southern Europe.

Authors: Laura R Botigué; Brenna M Henn; Simon Gravel; Brian K Maples; Christopher R Gignoux; Erik Corona; Gil Atzmon; Edward Burns; Harry Ostrer; Carlos Flores; Jaume Bertranpetit; David Comas; Carlos D Bustamante
Journal: Proc Natl Acad Sci U S A Date: 2013-06-03 Impact factor: 11.205

7. A systematic survey of loss-of-function variants in human protein-coding genes.

Authors: Daniel G MacArthur; Suganthi Balasubramanian; Adam Frankish; Ni Huang; James Morris; Klaudia Walter; Luke Jostins; Lukas Habegger; Joseph K Pickrell; Stephen B Montgomery; Cornelis A Albers; Zhengdong D Zhang; Donald F Conrad; Gerton Lunter; Hancheng Zheng; Qasim Ayub; Mark A DePristo; Eric Banks; Min Hu; Robert E Handsaker; Jeffrey A Rosenfeld; Menachem Fromer; Mike Jin; Xinmeng Jasmine Mu; Ekta Khurana; Kai Ye; Mike Kay; Gary Ian Saunders; Marie-Marthe Suner; Toby Hunt; If H A Barnes; Clara Amid; Denise R Carvalho-Silva; Alexandra H Bignell; Catherine Snow; Bryndis Yngvadottir; Suzannah Bumpstead; David N Cooper; Yali Xue; Irene Gallego Romero; Jun Wang; Yingrui Li; Richard A Gibbs; Steven A McCarroll; Emmanouil T Dermitzakis; Jonathan K Pritchard; Jeffrey C Barrett; Jennifer Harrow; Matthew E Hurles; Mark B Gerstein; Chris Tyler-Smith
Journal: Science Date: 2012-02-17 Impact factor: 47.728

8. Population structure and eigenanalysis.

Authors: Nick Patterson; Alkes L Price; David Reich
Journal: PLoS Genet Date: 2006-12 Impact factor: 5.917

9. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

10. Reconstructing the population genetic history of the Caribbean.

Authors: Andrés Moreno-Estrada; Simon Gravel; Fouad Zakharia; Jacob L McCauley; Jake K Byrnes; Christopher R Gignoux; Patricia A Ortiz-Tello; Ricardo J Martínez; Dale J Hedges; Richard W Morris; Celeste Eng; Karla Sandoval; Suehelay Acevedo-Acevedo; Paul J Norman; Zulay Layrisse; Peter Parham; Juan Carlos Martínez-Cruzado; Esteban González Burchard; Michael L Cuccaro; Eden R Martin; Carlos D Bustamante
Journal: PLoS Genet Date: 2013-11-14 Impact factor: 5.917

138 in total

1. FAM92A Underlies Nonsyndromic Postaxial Polydactyly in Humans and an Abnormal Limb and Digit Skeletal Phenotype in Mice.

Authors: Isabelle Schrauwen; Arnaud Pj Giese; Abdul Aziz; David Tino Lafont; Imen Chakchouk; Regie Lyn P Santos-Cortez; Kwanghyuk Lee; Anushree Acharya; Falak Sher Khan; Asmat Ullah; Deborah A Nickerson; Michael J Bamshad; Ghazanfar Ali; Saima Riazuddin; Muhammad Ansar; Wasim Ahmad; Zubair M Ahmed; Suzanne M Leal
Journal: J Bone Miner Res Date: 2018-11-05 Impact factor: 6.741

2. Lessons Learned from Large-Scale, First-Tier Clinical Exome Sequencing in a Highly Consanguineous Population.

Authors: Dorota Monies; Mohammed Abouelhoda; Mirna Assoum; Nabil Moghrabi; Rafiullah Rafiullah; Naif Almontashiri; Mohammed Alowain; Hamad Alzaidan; Moeen Alsayed; Shazia Subhani; Edward Cupler; Maha Faden; Amal Alhashem; Alya Qari; Aziza Chedrawi; Hisham Aldhalaan; Wesam Kurdi; Sameena Khan; Zuhair Rahbeeni; Maha Alotaibi; Ewa Goljan; Hadeel Elbardisy; Mohamed ElKalioby; Zeeshan Shah; Hibah Alruwaili; Amal Jaafar; Ranad Albar; Asma Akilan; Hamsa Tayeb; Asma Tahir; Mohammed Fawzy; Mohammed Nasr; Shaza Makki; Abdullah Alfaifi; Hanna Akleh; Suad Yamani; Dalal Bubshait; Mohammed Mahnashi; Talal Basha; Afaf Alsagheir; Musad Abu Khaled; Khalid Alsaleem; Maisoon Almugbel; Manal Badawi; Fahad Bashiri; Saeed Bohlega; Raashida Sulaiman; Ehab Tous; Syed Ahmed; Talal Algoufi; Hamoud Al-Mousa; Emadia Alaki; Susan Alhumaidi; Hadeel Alghamdi; Malak Alghamdi; Ahmed Sahly; Shapar Nahrir; Ali Al-Ahmari; Hisham Alkuraya; Ali Almehaidib; Mohammed Abanemai; Fahad Alsohaibaini; Bandar Alsaud; Rand Arnaout; Ghada M H Abdel-Salam; Hasan Aldhekri; Suzan AlKhater; Khalid Alqadi; Essam Alsabban; Turki Alshareef; Khalid Awartani; Hanaa Banjar; Nada Alsahan; Ibraheem Abosoudah; Abdullah Alashwal; Wajeeh Aldekhail; Sami Alhajjar; Sulaiman Al-Mayouf; Abdulaziz Alsemari; Walaa Alshuaibi; Saeed Altala; Abdulhadi Altalhi; Salah Baz; Muddathir Hamad; Tariq Abalkhail; Badi Alenazi; Alya Alkaff; Fahad Almohareb; Fuad Al Mutairi; Mona Alsaleh; Abdullah Alsonbul; Somaya Alzelaye; Shakir Bahzad; Abdulaziz Bin Manee; Ola Jarrad; Neama Meriki; Bassem Albeirouti; Amal Alqasmi; Mohammed AlBalwi; Nawal Makhseed; Saeed Hassan; Isam Salih; Mustafa A Salih; Marwan Shaheen; Saadeh Sermin; Shamsad Shahrukh; Shahrukh Hashmi; Ayman Shawli; Ameen Tajuddin; Abdullah Tamim; Ahmed Alnahari; Ibrahim Ghemlas; Maged Hussein; Sami Wali; Hatem Murad; Brian F Meyer; Fowzan S Alkuraya
Journal: Am J Hum Genet Date: 2019-05-23 Impact factor: 11.025

3. Understanding the Hidden Complexity of Latin American Population Isolates.

Authors: Jazlyn A Mooney; Christian D Huber; Susan Service; Jae Hoon Sul; Clare D Marsden; Zhongyang Zhang; Chiara Sabatti; Andrés Ruiz-Linares; Gabriel Bedoya; Nelson Freimer; Kirk E Lohmueller
Journal: Am J Hum Genet Date: 2018-10-25 Impact factor: 11.025

4. Myopathic changes associated with psychomotor delay and seizures caused by a novel homozygous mutation in TBCK.

Authors: Simona Saredi; Edmund S Cauley; Alessandra Ruggieri; Tyler M Spivey; Anna Ardissone; Marina Mora; Isabella Moroni; M Chiara Manzini
Journal: Muscle Nerve Date: 2020-05-27 Impact factor: 3.217

5. Clinical and genetic study of Tunisian families with genetic generalized epilepsy: contribution of CACNA1H and MAST4 genes.

Authors: Zied Landoulsi; Fatma Laatar; Eric Noé; Saloua Mrabet; Mouna Ben Djebara; Guillaume Achaz; Caroline Nava; Stéphanie Baulac; Imen Kacem; Amina Gargouri-Berrechid; Riadh Gouider; Eric Leguern
Journal: Neurogenetics Date: 2018-06-12 Impact factor: 2.660

6. Genomic landscape of the Greater Middle East.

Authors: Tayfun Özçelik; Onur Emre Onat
Journal: Nat Genet Date: 2016-08-30 Impact factor: 38.330

7. COQ4 Mutation Leads to Childhood-Onset Ataxia Improved by CoQ10 Administration.

Authors: Ahmet Okay Caglayan; Hakan Gumus; Erin Sandford; Thomas L Kubisiak; Qianyi Ma; A Bilge Ozel; Huseyin Per; Jun Z Li; Vikram G Shakkottai; Margit Burmeister
Journal: Cerebellum Date: 2019-06 Impact factor: 3.847

8. A biallelic ANTXR1 variant expands the anthrax toxin receptor associated phenotype to tooth agenesis.

Authors: Nuriye Dinckan; Renqian Du; Zeynep C Akdemir; Yavuz Bayram; Shalini N Jhangiani; Harsha Doddapaneni; Jianhong Hu; Donna M Muzny; Yeliz Guven; Oya Aktoren; Hulya Kayserili; Eric Boerwinkle; Richard A Gibbs; Jennifer E Posey; James R Lupski; Zehra O Uyguner; Ariadne Letra
Journal: Am J Med Genet A Date: 2018-02-13 Impact factor: 2.802

9. Uner Tan syndrome caused by a homozygous TUBB2B mutation affecting microtubule stability.

Authors: Martin W Breuss; Thai Nguyen; Anjana Srivatsan; Ines Leca; Guoling Tian; Tanja Fritz; Andi H Hansen; Damir Musaev; Jennifer McEvoy-Venneri; Kiely N James; Rasim O Rosti; Eric Scott; Uner Tan; Richard D Kolodner; Nicholas J Cowan; David A Keays; Joseph G Gleeson
Journal: Hum Mol Genet Date: 2017-01-15 Impact factor: 6.150

10. Mutations in LNPK, Encoding the Endoplasmic Reticulum Junction Stabilizer Lunapark, Cause a Recessive Neurodevelopmental Syndrome.

Authors: Martin W Breuss; An Nguyen; Qiong Song; Thai Nguyen; Valentina Stanley; Kiely N James; Damir Musaev; Guoliang Chai; Sara A Wirth; Paula Anzenberg; Renee D George; Anide Johansen; Shaila Ali; Muhammad Zia-Ur-Rehman; Tipu Sultan; Maha S Zaki; Joseph G Gleeson
Journal: Am J Hum Genet Date: 2018-07-19 Impact factor: 11.025