Literature DB >> 23666864

The GenoChip: a new tool for genetic anthropology.

Eran Elhaik¹, Elliott Greenspan, Sean Staats, Thomas Krahn, Chris Tyler-Smith, Yali Xue, Sergio Tofanelli, Paolo Francalacci, Francesco Cucca, Luca Pagani, Li Jin, Hui Li, Theodore G Schurr, Bennett Greenspan, R Spencer Wells.

Abstract

The Genographic Project is an international effort aimed at charting human migratory history. The project is nonprofit and nonmedical, and, through its Legacy Fund, supports locally led efforts to preserve indigenous and traditional cultures. Although the first phase of the project was focused on uniparentally inherited markers on the Y-chromosome and mitochondrial DNA (mtDNA), the current phase focuses on markers from across the entire genome to obtain a more complete understanding of human genetic variation. Although many commercial arrays exist for genome-wide single-nucleotide polymorphism (SNP) genotyping, they were designed for medical genetic studies and contain medically related markers that are inappropriate for global population genetic studies. GenoChip, the Genographic Project's new genotyping array, was designed to resolve these issues and enable higher resolution research into outstanding questions in genetic anthropology. The GenoChip includes ancestry informative markers obtained for over 450 human populations, an ancient human (Saqqaq), and two archaic hominins (Neanderthal and Denisovan) and was designed to identify all known Y-chromosome and mtDNA haplogroups. The chip was carefully vetted to avoid inclusion of medically relevant markers. To demonstrate its capabilities, we compared the FST distributions of GenoChip SNPs to those of two commercial arrays. Although all arrays yielded similarly shaped (inverse J) FST distributions, the GenoChip autosomal and X-chromosomal distributions had the highest mean FST, attesting to its ability to discern subpopulations. The chip performances are illustrated in a principal component analysis for 14 worldwide populations. In summary, the GenoChip is a dedicated genotyping platform for genetic anthropology. With an unprecedented number of approximately 12,000 Y-chromosomal and approximately 3,300 mtDNA SNPs and over 130,000 autosomal and X-chromosomal SNPs without any known health, medical, or phenotypic relevance, the GenoChip is a useful tool for genetic anthropology and population genetics.

Entities: Disease Gene Species

Keywords: AimsFinder; GenoChip; Genographic Project; genetic anthropology; haplogroups; population genetics

Mesh：

Substances：
DNA, Mitochondrial

Year: 2013 PMID： 23666864 PMCID： PMC3673633 DOI： 10.1093/gbe/evt066

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

Apportionment of human genetic variation has long established that all living humans are related via recent common ancestors who lived in sub-Saharan Africa some 200,000 years ago (Cann et al. 1987). The world outside Africa was settled over the past 50,000–100,000 years (Henn et al. 2010) when the descendents of our African forebears spread out to populate other continents (Cavalli-Sforza 2007). This “Out-of-Africa” hypothesis, backed by archeological findings (Klein 2008) and genetic evidence (Stringer and Andrews 1988; Laval et al. 2010), describes a major dispersal of anatomically modern humans that completely replaced local archaic populations outside Africa, although a scenario involving Europeans and West Africans admixing with extinct hominins was also proposed (Plagnol and Wall 2006). Remarkably, recent studies proposed evidence for two such archaic admixture (interbreeding) events, one with Neanderthals in Europe and eastern Asia (Green et al. 2010) and the second with Denisovans in Southeast Asia and Oceania (Reich et al. 2011), though the extent of the hybridization remains questionable (Eriksson and Manica 2012). Overall, the recurrent migrations, admixture, and interbreeding events shaped the autosomes of modern populations into mosaics of ancient and recent alleles harbored in haplotypes that vary in size but not in the building blocks themselves. These subtle differences in autosomal allele frequency between populations together with uniparental markers provide genetic data with the potential to obtain evidence of mixing and migration of human populations. The advent of microarray single-nucleotide polymorphism (SNP) technology that revolutionized human population genetics and broadened our understanding of genetic diversity largely skipped genetic anthropology for three main reasons: first, only a handful of the estimated 5,000–6,000 indigenous population groups (Burger and Strong 1990; Fardon 2012) were genotyped and studied, which may limit the phylogeographic resolution of the findings. Second, the plethora of genetic markers obtained from different genotyping platforms has resurrected the “empty matrix” problem, whereby populations from different studies can barely be compared due to the low overlap of these platforms. Finally, genotyping costs remained prohibitively high and unjustified for genetic anthropology, as the commercial genotyping platforms, by large, do not accommodate ancestry informative markers (AIMs). Furthermore, these arrays are enriched in trait- or disease-related markers, which prompt a host of psychological, social, legal, political, and ethical concerns from the individual to the population and global levels (Royal et al. 2010). The first phase of The Genographic Project focused on reconstructing human migratory paths through the analysis of uniparentally inherited markers on the Y-chromosome and mitochondrial DNA (mtDNA). The success of the project in both inferring details of human migratory history (e.g., Balanovsky et al. 2011; Schurr et al. 2012) and attracting over half a million public participants interested in tracing their genetic ancestry has prompted entrepreneurs to offer multiple self-test kits that provide information ranging from disease risk and life-style choices (e.g., diet) to genetic ancestry (Wolinsky 2006). Some of these solutions have been criticized for making deceptive health-related claims and providing limited and imprecise answers regarding ancestry (Royal et al. 2010). The concerns about ancestry reporting were not unjustified, as these entrepreneurs adopted the commercial genotyping platforms that were fraught with medically informative markers, depleted of AIMs, and overall yielded biased measures of genetic diversity (Albrechtsen et al. 2010). Although uniparental arrays do not suffer from the aforementioned predicaments, they are limited in that they represent only a smaller and more ancient portion of our history and ignore our remaining ancestors whose contribution to our genome was more recent and substantial. In contrast, assessment of the spatial and temporal patterns of genetic variation in the rest of the genome coupled with data obtained from other disciplines can provide more information of our ancestors. However, autosomal-driven studies attempting to discern markers informative to genetic anthropology from those having medical relevance often met with legal or ethical obstacles and failed to attract participants who remained concerned about the sharing and potential exploitation of their medical information (Royal et al. 2010). These constraints render all commercial genotyping arrays unsuitable for genetic anthropology, including the Human Origins array (Lu et al. 2011) that contains coding and medically related markers. To facilitate high-quality research in genetic anthropology without obtaining health, trait, or medical information, we resolved to develop a novel genotyping array—which we call the GenoChip. Our goals were to 1) design a state of the art SNP array dedicated solely to genetic anthropology, 2) validate its accuracy, 3) evaluate its abilities to discern populations compared with alternative arrays, and 4) demonstrate its performances on worldwide populations.

Materials and Methods

Genotype Data Retrieval

AIMs were obtained from 15 studies (Yang et al. 2005; Price et al. 2007, 2008; Halder et al. 2008; Tian et al. 2008, 2009; Florez et al. 2009; Kosoy et al. 2009; McEvoy et al. 2009, 2010; Nassir et al. 2009; Henn et al. 2011; Kidd et al. 2011). Genotype data for thousands of samples from over 300 worldwide populations were obtained from 15 public and private collections (Conrad et al. 2006; Reich et al. 2009; Silva-Zolezzi et al. 2009; Teo et al. 2009; Xing et al. 2009, 2010; Altshuler et al. 2010; Behar et al. 2010; Hunter-Zinck et al. 2010; Rasmussen et al. 2010, 2011; Chaubey et al. 2011; Hatin et al. 2011; Henn et al. 2011; Yunusbayev et al. 2012) and the FamilyTreeDNA collection. To study gene flow from apes, ancient hominins, and modern humans, we used the data set of 257,000 high-quality autosomal SNPs assembled by Reich et al. (2010).

SNP Validation

To cross-validate the GenoChip’s autosomal genotypes, we genotyped 168 samples from 14 worldwide populations of the 1000 Genomes Project including Americans of African ancestry from Southwest United States (ASW), Americans of Mexican ancestry from Los Angeles, CA (MEX), Utah residents with Northern and Western European ancestry from UT (CEU), England and Scotland British (GBR), Finnish from Finland (FIN), Gujarati Indians from Houston, TX (GIH), Han Chinese from Beijing, China (CHB), Iberians from Spain (IBS), Italians from Tuscany, Italy (TSI), Japanese from Tokyo, Japan (JPT), Kinh from Ho Chi Minh City, Vietnam (KHV), Luhya in Webuye, Kenya (LWK), Peruvians from Lima, Peru (PEL), and Yoruba in Ibadan, Nigeria (YRI). The concordance rate between GenoChip and the 1000 Genomes Project genotypes was calculated as the proportion of genotypes that were identical between the two data sets.

Comparing Population Genetic Summary Statistics between Genotyping Arrays

To compare the performances of the validated approximately 130,000 autosomal and X-chromosomal SNPs of the GenoChip array to commercial arrays, we obtained the list of SNPs for the Illumina Human660W-Quad BeadChip (544,366 SNPs) from Illumina and the Affymetrix Axiom Human Origins array (627,719 SNPs) available at ftp://ftp.cephb.fr/hgdp_supp10/Harvard_HGDP-CEPH/all_snp.map.gz (last accessed May 19, 2013). Because of the lack of overlap between these genotyping arrays, we used subsets of data calculated for HapMap III populations. Minor allele frequency (MAF) and FST estimates for African, European, and Asians were obtained from the “continental” HapMap data set, as described in Elhaik (2012). Briefly, genotype data of 602 unrelated individuals from eight populations (YRI, LWK, Maasai in Kinyawa, Kenya [MKK], CEU, TSI, CHB, Chinese from metropolitan Denver, Colorado [CHD], and JPT) were downloaded from the International HapMap Project web site (phase 3, second draft) (Altshuler et al. 2010), passed through rigorous filtering criteria, and finally merged into continental populations (African [288], European [144], and Asian [170]). The final continental data set consisted of 3 million SNPs genotyped in at least one population from each continent. The MAF and FST values of the continental data set for autosomal (2,823,367) and X-chromosomal (86,449) SNPs were compared with those obtained from GenoChip (126,425 and 2,421 SNPs, respectively), Illumina Human660W (541,104 and 12,916 SNPs, respectively), and Affymetrix Axiom Human Origins Array (308,949 and 2,984 SNPs, respectively). Because of the large number of FST values in each data set, their length distributions are very noisy. We thus adopted a simple smoothing approach in which FST values are sorted and divided to 1,000 equally sized subsets. The distribution of the mean FST value is then calculated using a histogram with 40 equally sized bins ranging from 0 to 1. To test whether two such FST distributions obtained by different arrays are different, we used the Kolmogorov–Smirnov goodness-of-fit test and the false discovery rate correction for multiple tests (Benjamini and Hochberg 1995). Because the differences between the distributions were highly significant due to the large sample sizes, we also calculated the effect size, first by using the nonoverlapping percentage of the two distributions, and then by using Hedges' g estimator of Cohen’s d (Hedges 1981). If the area overlap is larger than 98% and Cohen’s d is smaller than 0.05, we consider the magnitude of the difference between the two distributions to be too small to be biologically meaningful. Principal components analysis (PCA) calculations were carried out using smartpca of the EIGENSOFT package (Patterson et al. 2006). Polygons were drawn manually around populations clustered separately from one another.

Results and Discussion

Designing the GenoChip

Choosing the Markers

The GenoChip was designed as an Illumina iSelect HD custom genotyping bead array that offers the ability to interrogate almost any SNP. In designing the chip, we endeavored to identify the fewest possible SNPs that offer an increased power for ancestry inference in comparison to random markers (Royal et al. 2010). SNPs that discern and identify populations are termed AIMs and are considered invaluable tools in population genetics and genetic anthropology. Half of our AIMs were culled from the literature, and the remaining were calculated using our novel AIMsFinder based on an approach described by Elhaik (2013) and infocalc (Rosenberg 2005) (supplementary text S1, Supplementary Material online). These two methods were applied on global panels comprising over 300 populations (supplementary table S1, Supplementary Material online) assembled from public and private data sets that were genotyped on a diversified set of arrays ranging from 30,000 to more than million SNPs in size. Many of these populations are unique to our project and have never before studied or searched for AIMs. Because AIMsFinder infers the minimal number of markers necessary to discern two genetically distinct populations, it was applied in a pairwise fashion over all the population data sets. In contrast, infocalc that ranks SNPs by their informativeness to ancestry was applied to whole population panels organized by the source of the genotype data (supplementary table S1, Supplementary Material online), where the top 1% of the results was considered AIMs. Overall, we ascertained over 80,000 autosomal and X-chromosomal AIMs from over 450 worldwide populations (fig. 1).

Worldwide distribution of population from which AIMs were obtained. AIMs from over 450 world populations were harvested from the literature (green) and calculated based on genotyped data from public and private collections (red) including over 30 Jewish populations (blue). To facilitate studies on the extent of gene flow from Neanderthal and Denisovan to modern humans, we collected from the literature SNPs and haplotypes from genomic regions bearing evidence of interbreeding (Noonan et al. 2006; Green et al. 2010; Yotova et al. 2011). In addition, we used a modified version of IsoPlotter+ (Elhaik et al. 2010; Elhaik and Graur 2013) to identify regions in which modern humans and Neanderthals share the derived allele and chimpanzees and Denisovans share the ancestral allele (supplementary text S1, Supplementary Material online). Using the same approach, we identified SNPs within regions enriched for the Denisovan shared derived alleles with humans. Overall, we included nearly 26,000 autosomal and X-chromosomal SNPs from potential interbreeding hotspots with extinct hominins. To support studies of more recent gene flow from ancient to modern humans, we included approximately 10,400 high-confidence Paleo-Eskimo Saqqaq SNPs (Rasmussen et al. 2010). In addition, we included approximately 12,000 high-confidence Aboriginal SNPs (Rasmussen et al. 2011). High-linkage disequilibrium (LD) SNPs (r2 > 0.4) were excluded in all populations, by choosing a random SNP of the high-LD pair, except for hunter gatherers such as the Hadza and Sandawe of Tanzania (Tishkoff and Williams 2002) and Melanesian populations (Conrad et al. 2006) that are used to infer interbreeding with extinct hominins (Reich et al. 2010; Lachance et al. 2012). To support potential imputation efforts, we supplemented regions of low SNP density (<1 SNP over 100,000 bases) with random common SNPs from HapMap III (1,000 SNPs with MAF > 20%) and the 1000 Genomes Project (3,500 SNPs with MAF > 10% in at least one continental population). To prevent false positives, we included mostly SNPs observed in both the HapMap III and 1000 Genome Project data sets (Altshuler et al. 2010; Durbin et al. 2010). We further eliminated A/T and C/G SNPs to minimize strand misidentification. The resulting chip has a SNP density of at least 1/100 kilobases over 92% of the assembled human genome (hg19) (fig. 2), including regions uncharted by the HapMap (I-III) and HGDP projects (Conrad et al. 2006; Altshuler et al. 2010). This high density of the chip and the excess inclusion of AIMs make it suitable for imputation, particularly for common markers (Pasaniuc et al. 2012).

SNP density in the Genochip. The average numbers of GenoChip SNPs per 100,000 nucleotides across the genome are color coded. Gaps in the assembly are shown in gray.

SNP density in the Genochip. The average numbers of GenoChip SNPs per 100,000 nucleotides across the genome are color coded. Gaps in the assembly are shown in gray. Finally, we constructed over 45,000 probes to identify SNPs defining all known Y-chromosome and mtDNA haplogroups, many of which were not reported in the literature (supplementary text S2, Supplementary Material online).

Compatibility to Commercial Genotyping Arrays

Looking at autosomal and X-chromosomal SNPs, the GenoChip is highly compatible with other commercial arrays. Some 76% of our SNPs overlap with those in the Illumina Human 660W-Quad array, 55% overlap with the Illumina HumanOmni1-Quad, Illumina Express, and Affymetrix 6.0 arrays, and 40% overlap with the Affymetrix 5.0 and Affymetrix Human Origins arrays. With the exception of dedicated Y chromosome and mtDNA chips, the GenoChip includes the most comprehensive collection of uniparental markers.

Vetting the Chip for Health, Trait, or Medical Markers

Several steps were taken to ensure that the genetic results would not be exploited for pharmaceutical, medical, and biotechnological purposes. First, participant samples were maintained in complete anonymity during GenoChip analysis. Second, no phenotypic or medical data were collected from the participants. Third, we included only SNPs in noncoding regions without any known functional association (Graur et al. 2013), as reported in dbSNP build 132. Last, we filtered our SNP collection against a 1.5 million SNP data set (Pheno SNPs) containing all variants that have potential, known, or suspected associations with diseases. To construct the Pheno SNPs data set, we extracted SNPs from multiple open-access databases including the Online Mendelian Inheritance in Man (OMIM) (http://www.ncbi.nlm.nih.gov/omim/, last accessed May 19, 2013), the Cancer Genome Atlas (Hudson et al. 2010), PhenCode (Giardine et al. 2007), the National Human Genome Research Institute (NHGRI) Genome-Wide Association Studies (GWAS) Catalog (Hindorff et al. 2009), The Genetic Association Database (Becker et al. 2004), MutaGeneSys (Stoyanovich and Pe'er 2008), GWAS Central (Thorisson et al. 2009), and SNPedia, as well as SNPs identified in the major histocompatibility complex (MHC) region. We also excluded SNPs reported to be associated with phenotypic traits. Finally, to circumvent imputation efforts toward inferring potential medical-relevant SNPs, we excluded SNPs that were in high LD (r2 > 0.8) with the Pheno SNPs. We thus designed the first genotyping array dedicated for genetic anthropological and genealogical research that is suitable for detecting gene flow from archaic hominins and ancient humans into modern humans as well as between worldwide populations. The final GenoChip has over 130,000 highly informative autosomal and X-chromosomal markers, approximately 12,000 Y-chromosomal markers, and approximately 3,300 mtDNA markers without any known health, medical, or phenotypic relevance (supplementary table S2, Supplementary Material online).

Validating the GenoChip Results

The accuracy of the autosomal genotypes obtained by the GenoChip was assessed by genotyping 168 worldwide samples from the 1000 Genomes Project and cross-validating the results. The concordance rate per sample was over 99.5%. We did not observe any position with mismatching homozygote alleles. The marginal error rate was expected due to the low coverage of the 1000 Genomes Project data, particularly for rare alleles (Durbin et al. 2010). We thus confirmed that genotypes reported by the GenoChip are accurate. The ability of the GenoChip to infer uniparental haplogroups was similarly assessed by genotyping 400 additional samples with known haplogroups. The haplotypes of these samples were confirmed by Sanger sequencing of the full mitochondrial genome and all relevant Y chromosome SNP locations that determined the exact haplogroup down to the last branch of the published Y-chromosomal tree (supplementary text S2, Supplementary Material online). The average success rates for the paternal and maternal haplogroups were 82% and 90%, respectively (fig. 3). The reasons for our inability to validate the remaining haplogroups are the unavailability of control samples to identify deeper splits in the tree. Moreover, some haplogroups cannot be measured with the Illumina bead chip technology because they are not represented by a real SNP but rather by large-scale variations of repetitive elements. We note that some of the failed markers for particular haplogroups can be substituted by phylogenetically equivalent markers and rescue these haplogroups, although formally they were counted as missing. Our experience with the tens of thousands of GenoChip participants indicates that most samples (>99%) are classified on haplogroup branches that are perfectly captured by the GenoChip. The remaining users for which the exact position along the tree cannot be assigned (e.g., R-P312*) are classified to a higher level haplogroup (e.g., R-P310). A large-scale genotyping effort to validate the remaining haplogroups is undergoing. We thus confirmed that GenoChip produces highly accurate results and has broad coverage for markers defining Y-chromosome and mtDNA haplogroups.

Success rate in identifying Y-chromosomal (left) and mtDNA (right) haplogroups. The plots depict all known basal haplogroups (columns), the number of known subgroups in each haplogroup (top of each column), and the proportion of subgroups that were validated with the GenoChip.

Testing the GenoChip’s Abilities to Discern Populations

MAF Distribution

Before comparing the ability of the GenoChip SNPs to discern populations, we compared the similarity of their MAF distribution with those of the Illumina Human660W and Affymetrix Human Origins SNP arrays. Because of the low overlap of these three arrays, we obtained and analyzed genotype data from eight HapMap populations. The results of the complete set of HapMap markers were compared with three subsets of markers that overlapped with those of each array. A comparison of the MAF distributions of the three arrays revealed gross differences in allele frequencies (fig. 4, supplementary fig. S1, Supplementary Material online). In the HapMap data set, over 82% of the SNPs are common (MAF > 0.05) and less than 5% are considered rare (MAF < 0.01). The proportion of common SNPs in all the arrays is similar (96–98%), but the GenoChip is enriched for the most common SNPs (MAF > 0.25). Because of the high frequency of the rare ENCODE SNPs in the HapMap data set, none of the arrays resembled the shape of the HapMap’s MAF distribution. Nonetheless, both the Human660W (0.07%) and Human Origins (0.36%) arrays are enriched in rare SNPs compared with the GenoChip (0.008%). Similar trends were observed for X-chromosomal SNPs. Here, the HapMap data set consisted of 83% common SNPs, compared with 93% for the GenoChip and 96% for the commercial arrays. The GenoChip array exhibits similar enrichment in the most common SNPs (MAF > 0.3), but unlike the commercial arrays, it also consists of 1% extremely rare SNPs due to the inclusion of rare haplotypes speculated to indicate interbreeding with archaic hominins. Altogether, the MAF distributions of the three arrays differ from the HapMap MAF distribution and correspond to the choices of SNP ascertainment made in the design of each array.

MAF distributions for autosomal (a) and X-chromosomal (b) HapMap SNPs. MAF distributions are shown for HapMap SNPs and two subsets that overlap with the Illumina Human660W and GenoChip SNPs.

Genomewide FST Distribution

To assess the extent of genetic diversity that can be inferred among human subpopulations by the different arrays, we next compared their FST distributions (Wright 1951). FST measures the differentiation of a subpopulation relative to the total population and is directly related to the variance in allele frequency between subpopulations, such that a high FST corresponds to a larger difference between subpopulations (Holsinger and Weir 2009). Elhaik (2012) used 1 million markers that were genotyped in 602 HapMap samples from eight populations to carry out a two-level hierarchical FST analysis. He showed that the greatest proportion of genetic variation occurred within individuals residing in the same populations, with only a small amount (12%) of the total genetic variation being distributed between continental populations and even a lesser amount (1%) between intracontinental populations. An FST distribution for three continental populations employing 3 million HapMap SNPs yielded an even lower estimate (8%) to the proportion of genetic variation distributed between continental populations due to the large number of rare alleles (Elhaik 2012). In a similar manner to (Elhaik 2012) later analysis, we used the FST values calculated for eight HapMap populations grouped into three continental populations to create three subsets for the markers that overlap with each array. Although all FST distributions were similar in shape to the HapMap FST distribution, they differed in their means (fig. 5, supplementary fig. S2, Supplementary Material online). The autosomes and X-chromosomal SNPs of the commercial arrays have significantly lower FST values (Kolmogorov–Smirnov goodness-of-fit test, P < 0.05) than that of the GenoChip due to the high fraction of rare uninformative SNPs in these arrays. The magnitude of the differences between the FST values of the GenoChip to those of the commercial arrays were also large for autosomal (area overlap 86–91%, Cohen's d 0.09–0.13) and X-chromosomal SNPs (area overlap 93%, Cohen's d 0.09–0.11). These results suggest a reduced ability of the commercial arrays to elucidate ancient demographic processes (Kimura and Ota 1973; Watterson and Guess 1977).

Distribution of locus-specific FST in three continental populations. FST values were obtained for (a) autosomal and (b) X-chromosomal HapMap SNPs. FST distributions are shown for HapMap SNPs and two subsets that overlap with the Illumina Human660W and GenoChip SNPs. The histograms show bin distribution as indicated on the x axis and the cumulative distribution (line). The Illumina Human660W array had the highest fraction of low-FST alleles, suggesting it is the least suitable for population genetic studies compared with the GenoChip and Human Origins. As only half of the Human Origins SNPs could be tested, it is difficult to evaluate its performance. However, we speculate that the large number of rare alleles reflect the private alleles of the dozen populations used for its ascertainment. Because the MAF and FST were not used as filtering criteria for the GenoChip SNPs, we can conclude that its enrichment toward high-FST SNPs mirrors the success of the ascertainment process and its potential for population genetic studies.

Genetic Diversity in Worldwide Populations

Last, PCA (Price et al. 2006) was used to explore the extent of population differentiation between 14 worldwide populations that were genotyped on the GenoChip in the validation stage (fig. 6A). The samples aligned along the two well-established geographic axes of global genetic variation: PC1 (sub-Saharan Africa vs. the rest of the Old World) and PC2 (east vs. west Eurasia) (e.g., Li et al. 2008; Elhaik 2013). GenoChip results reveal geographically refined groupings of Eastern (Luhya) and Western (Yoruba) Africans, Eastern (Chinese and Japanese) and South Eastern (Vietnamese) Asians, Amerindian (Peruvians Mexicans) and Indian populations, and finally Northern (Finnish), Southern (Italian and Iberians), and Western (British and CEU) Europeans. As expected, the Amerindian populations form a gradient along the diagonal line between European and East Asians based on their dominant ancestry as did the African Americans along the diagonal line between Africans and Europeans. These patterns are similar to those observed in worldwide populations using commercial arrays (e.g., Teo et al. 2009; Xing et al. 2010).

PCA plots of genetic diversity across 14 worldwide populations. Each figure represents the genetic diversity seen across the populations considered, with each sample mapped onto a spectrum of genetic variation represented by two axes of variations corresponding to two eigenvectors of the PCA. Individuals from each population are represented by a unique color. (A) Analysis of all populations. The insets magnify European, Asian, and the cluster of Amerindian and Indian individuals. (B) Analysis of East Asian individuals. (C) Analysis of European individuals. (D) Analysis of Amerindian and Indian individuals. A polygon surrounding all or most of the individuals belonging to a group designation highlights the population groups. When we consider only the East Asian populations (comprising CHB, JPT, and KHV), the first and second axes of variation completely separated the three populations (fig. 6B), in agreement with Teo et al. (2009). In a similar manner, we were able to differentiate Gujarati Indians and Americans of Mexican ancestry (fig. 6C), as well as Italians, Iberians, and Western European populations (fig. 6D), with the exception of one TSI outlier. As expected, some overlap was observed between individuals of Northern and Western European ancestry (CEU) and British (GBR).

Conclusions

To summarize, we designed, developed, validated, and tested the GenoChip, the first genotyping chip completely dedicated to genetic anthropology. The GenoChip will help to clarify the genetic relationships between archaic hominins such as Neanderthal and Denisovan, extinct humans, and modern humans as well as to provide a more detailed understanding of human migratory history. We compared the MAF and FST distributions of the GenoChip SNPs to those of HapMap and two commercially available arrays and demonstrated the ability of the GenoChip to differentiate subpopulations within global data sets. We expect that the expanded use of the GenoChip in genetic anthropology research will expand our knowledge of the history of our species.

Supplementary Material

Supplementary text S1 and S2, tables S1 and S2, and figures S1–S4, and are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

67 in total

1. Algorithms for selecting informative marker panels for population assignment.

Authors: Noah A Rosenberg
Journal: J Comput Biol Date: 2005-11 Impact factor: 1.479

2. Principal components analysis corrects for stratification in genome-wide association studies.

Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330

3. Analysis of genomic diversity in Mexican Mestizo populations to develop genomic medicine in Mexico.

Authors: Irma Silva-Zolezzi; Alfredo Hidalgo-Miranda; Jesus Estrada-Gil; Juan Carlos Fernandez-Lopez; Laura Uribe-Figueroa; Alejandra Contreras; Eros Balam-Ortiz; Laura del Bosque-Plata; David Velazquez-Fernandez; Cesar Lara; Rodrigo Goya; Enrique Hernandez-Lemus; Carlos Davila; Eduardo Barrientos; Santiago March; Gerardo Jimenez-Sanchez
Journal: Proc Natl Acad Sci U S A Date: 2009-05-11 Impact factor: 11.205

4. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits.

Authors: Lucia A Hindorff; Praveen Sethupathy; Heather A Junkins; Erin M Ramos; Jayashri P Mehta; Francis S Collins; Teri A Manolio
Journal: Proc Natl Acad Sci U S A Date: 2009-05-27 Impact factor: 11.205

5. European and Polynesian admixture in the Norfolk Island population.

Authors: B P McEvoy; Z Z Zhao; S Macgregor; C Bellis; R A Lea; H Cox; G W Montgomery; L R Griffiths; P M Visscher
Journal: Heredity (Edinb) Date: 2009-12-09 Impact factor: 3.821

6. Mitochondrial DNA and human evolution.

Authors: R L Cann; M Stoneking; A C Wilson
Journal: Nature Date: 1987 Jan 1-7 Impact factor: 49.962

7. Singapore Genome Variation Project: a haplotype map of three Southeast Asian populations.

Authors: Yik-Ying Teo; Xueling Sim; Rick T H Ong; Adrian K S Tan; Jieming Chen; Erwin Tantoso; Kerrin S Small; Chee-Seng Ku; Edmund J D Lee; Mark Seielstad; Kee-Seng Chia
Journal: Genome Res Date: 2009-08-21 Impact factor: 9.043

Review 8. Genetics in geographically structured populations: defining, estimating and interpreting F(ST).

Authors: Kent E Holsinger; Bruce S Weir
Journal: Nat Rev Genet Date: 2009-09 Impact factor: 53.242

9. Genetic and fossil evidence for the origin of modern humans.

Authors: C B Stringer; P Andrews
Journal: Science Date: 1988-03-11 Impact factor: 47.728

10. Population structure and eigenanalysis.

Authors: Nick Patterson; Alkes L Price; David Reich
Journal: PLoS Genet Date: 2006-12 Impact factor: 5.917

30 in total

1. Genome-wide signatures of male-mediated migration shaping the Indian gene pool.

Authors: GaneshPrasad ArunKumar; Tatiana V Tatarinova; Jeff Duty; Debra Rollo; Adhikarla Syama; Varatharajan Santhakumari Arun; Valampuri John Kavitha; Petr Triska; Bennett Greenspan; R Spencer Wells; Ramasamy Pitchappan
Journal: J Hum Genet Date: 2015-09 Impact factor: 3.172

2. Analysis of biogeographic ancestry reveals complex genetic histories for indigenous communities of St. Vincent and Trinidad.

Authors: Jada Benn Torres; Victoria Martucci; Melinda C Aldrich; Miguel G Vilar; Taryn MacKinney; Muhammad Tariq; Jill B Gaieski; Ricardo Bharath Hernandez; Zoila E Browne; Marlon Stevenson; Wendell Walters; Theodore G Schurr
Journal: Am J Phys Anthropol Date: 2019-05-24 Impact factor: 2.868

3. Population Histories of the United States Revealed through Fine-Scale Migration and Haplotype Analysis.

Authors: Chengzhen L Dai; Mohammad M Vazifeh; Chen-Hsiang Yeang; Remi Tachet; R Spencer Wells; Miguel G Vilar; Mark J Daly; Carlo Ratti; Alicia R Martin
Journal: Am J Hum Genet Date: 2020-03-05 Impact factor: 11.025

Review 4. Genomic approaches to the assessment of human spina bifida risk.

Authors: M Elizabeth Ross; Christopher E Mason; Richard H Finnell
Journal: Birth Defects Res Date: 2017-01-30 Impact factor: 2.344

5. New native South American Y chromosome lineages.

Authors: Marilza S Jota; Daniela R Lacerda; José R Sandoval; Pedro Paulo R Vieira; Dominique Ohasi; José E Santos-Júnior; Oscar Acosta; Cinthia Cuellar; Susana Revollo; Cesar Paz-Y-Miño; Ricardo Fujita; Gustavo A Vallejo; Theodore G Schurr; Eduardo M Tarazona-Santos; Sergio Dj Pena; Qasim Ayub; Chris Tyler-Smith; Fabrício R Santos
Journal: J Hum Genet Date: 2016-03-31 Impact factor: 3.172

Review 6. Population genetic considerations for using biobanks as international resources in the pandemic era and beyond.

Authors: Hannah Carress; Daniel John Lawson; Eran Elhaik
Journal: BMC Genomics Date: 2021-05-17 Impact factor: 3.969

7. Microarray Analysis of Copy Number Variants on the Human Y Chromosome Reveals Novel and Frequent Duplications Overrepresented in Specific Haplogroups.

Authors: Martin M Johansson; Anneleen Van Geystelen; Maarten H D Larmuseau; Srdjan Djurovic; Ole A Andreassen; Ingrid Agartz; Elena Jazin
Journal: PLoS One Date: 2015-08-31 Impact factor: 3.240

8. Differential Evolution approach to detect recent admixture.

Authors: Konstantin Kozlov; Dmitri Chebotarev; Mehedi Hassan; Martin Triska; Petr Triska; Pavel Flegontov; Tatiana V Tatarinova
Journal: BMC Genomics Date: 2015-06-18 Impact factor: 3.969

9. Geographic population structure analysis of worldwide human populations infers their biogeographical origins.

Authors: Eran Elhaik; Tatiana Tatarinova; Dmitri Chebotarev; Ignazio S Piras; Carla Maria Calò; Antonella De Montis; Manuela Atzori; Monica Marini; Sergio Tofanelli; Paolo Francalacci; Luca Pagani; Chris Tyler-Smith; Yali Xue; Francesco Cucca; Theodore G Schurr; Jill B Gaieski; Carlalynne Melendez; Miguel G Vilar; Amanda C Owings; Rocío Gómez; Ricardo Fujita; Fabrício R Santos; David Comas; Oleg Balanovsky; Elena Balanovska; Pierre Zalloua; Himla Soodyall; Ramasamy Pitchappan; Arunkumar Ganeshprasad; Michael Hammer; Lisa Matisoo-Smith; R Spencer Wells
Journal: Nat Commun Date: 2014-04-29 Impact factor: 14.919

10. IsoPlotter(+): A Tool for Studying the Compositional Architecture of Genomes.

Authors: Eran Elhaik; Dan Graur
Journal: ISRN Bioinform Date: 2013-04-18