Literature DB >> 21208434

Analyses of a set of 128 ancestry informative single-nucleotide polymorphisms in a global set of 119 population samples.

Judith R Kidd1, Françoise R Friedlaender, William C Speed, Andrew J Pakstis, Francisco M De La Vega, Kenneth K Kidd.   

Abstract

BACKGROUND: Using DNA to determine an individual's ancestry from among human populations is generally interesting and useful for many purposes, including admixture mapping, controlling for population structure in disease or trait association studies and forensic ancestry inference. However, to estimate ancestry, including possible admixture within an individual, as well as heterogeneity within a group of individuals, allele frequencies are necessary for what are believed to be the contributing populations. For this purpose, panels of ancestry informative markers (AIMs) have been developed.
RESULTS: We are presenting our work on one such panel, composed of 128 ancestry informative single-nucleotide polymorphisms (AISNPs) already proposed in the literature. Compared to previous studies of these AISNPs, we have studied three times the number of individuals (4,871) in three times as many population samples (119). We have validated this panel for many ancestry assignment and admixture studies, especially those that were the rationale for the original selection of the 128 SNPs: African Americans and Mexican Americans. At the same time, the limitations of the panel for distinguishing ancestry and quantifying admixture among Eurasian populations are noted.
CONCLUSION: We demonstrate the simultaneous importance of the specific set of population samples and their relative sample sizes in the use of the structure program to determine which groups cluster together and consequently influence the ability of a marker panel to infer ancestry. We demonstrate the strengths and weaknesses of this particular panel of AISNPs in a global context.

Entities:  

Year:  2011        PMID: 21208434      PMCID: PMC3025953          DOI: 10.1186/2041-2223-2-1

Source DB:  PubMed          Journal:  Investig Genet        ISSN: 2041-2223


Background

In recent years, there have been many proposed ancestry informative markers (AIMs) and published sets of AIMs useful for particular purposes. Some sets have focused on estimating the admixture between specific ancestral populations such as the African and European genetic contributions to African Americans or European, Native American and African contributions to Latino populations [e.g., [1-7]. Others have focused on distinguishing ancestral origins from three or four continental regions, such as sub-Saharan Africa, Europe, East Asia and the Americas [8-12], or more broadly between many globally distributed populations [13-16]. Yet others have focused on identifying the stratification of populations within particular geographic areas [e.g., [17-19] or within a clinical association study sample [20-22]. Whatever the purpose, the general usefulness of such AIMS depends very much on the set of populations used to identify and characterize them. Some global studies have used only a few but widely separated population samples [e.g., [15]. Others have used the HGDP-CEPH panel [23] of about 1,000 samples from 52 populations to select AIMs [10,11]. All approaches provide useful data but may also have weaknesses due to sampling error, either because the population samples used may not be highly representative of a broader geographic area or because the individual sample sizes are very small and subject to very large sampling errors. The same criticism applies to studies attempting to identify markers that provide ancestry information within a region, such as East Asia [24] or Europe [17,18]. Whatever the strategy for identifying them, AIMs are necessarily selected because they distinguish the specific population samples used. Therefore, replication with other samples of individuals from the same and/or closely related populations is necessary to verify the robustness of any set of AIMs. Such replication is onerous, costly and rarely undertaken. Given the broad interest in AIMs in genetics, medicine, anthropology and forensics, the development of an optimal set of AIMs for a broad range of uses needs to be based on multiple markers studied on moderate to large samples of multiple relevant populations; appropriate resources will probably not be available in any single lab. As we advocated in the case of single-nucleotide polymorphisms (SNPs) for individual identification [25,26], multiple labs need to test candidate markers on additional populations and for general robustness in the laboratory. While very large numbers of markers can provide quite accurate ancestry information for multiple geographic regions, small but robust sets of markers are especially useful. Seldin's group [6,27] identified a set of 128 SNPs that they showed is useful for identification of the continental origin of people and in estimating the admixture proportions of these individuals. Thus, a particular aim was to develop a set of SNPs whose allele frequencies had major differences between the continental populations for use in matching controls and subjects in association studies. They validated all and various subsets of the 128 SNPs in their initial study of 825 individuals from 20 designated populations and subsequently in a study of 1,620 individuals from 48 population samples using a subset of 93 of the 128 SNPs [27]. Understanding that a set of AIMs (or ancestry identification SNPs, AISNPs) will only be broadly useful for population relationships and for identifying admixture if that set can be shown on a very large data set to be valid, we have tripled the size of the Nassir et al. [27] population set, increased the number of population samples to 119 and analyzed this sample with the 128 AIMs of Kosoy et al. [6]. We find that this set of 128 AISNPs is not only globally informative for origins from major geographic regions but also informative for distinguishing relationships within several of those regions. This provides further support for the usefulness of this set of SNPs in some ancestry/admixture analyses. We also note that these AISNPs are not particularly good at distinguishing within certain groups of populations, and a comparison of the Nassir et al. [27] results with ours illustrates effects of choice and size of the population samples analyzed.

Methods

Samples

We assembled a data set of samples of 4,871 individuals: those from the HapMap 3 [28], the Human Genome Diversity Project (HGDP) [29,30] and our lab, all typed for the 128 SNPs of Kosoy et al. [6]. Some of the HGDP samples used by Nassir et al. [27] are also included in our study, and for some of their populations we have an independent sample, e.g., Ashkenazi Jews. The HGDP contains 355 DNA samples from our lab or from cell lines we hold and routinely type in our lab and another 31 HGDP DNA samples are DNA samples we also have in our lab. When an HGDP sample is a subset of one of our population samples, we used only our inclusive sample. When a sample from our lab overlapped with an HGDP sample, the duplicates were removed from our sample and the full HGDP sample was included separately from our supplementary sample. Thus, we occasionally have two samples from the same population (e.g., Druze, PNG, Makrani), but no individuals from the two samples overlap. Sixteen populations are represented by two to four samples. Some of the "duplicate" populations (e.g., Han, Russians, Maasai) were sampled in different areas or countries, and some of the "duplicate" populations are independent samples from the same locale. Finally, the offspring in the HapMap 3 samples (ASW, CEU, MKK, YRI, and MEX) were removed so that the samples include only unrelated people. Table 1 provides the name, sources of the data and sample size for each of the final set of 119 population samples. All samples from our lab were collected with informed consent under protocols approved by the Institutional Review Board (IRB) at Yale University and other relevant IRBs; the other data are in the public domain. Descriptions of all of the population samples are in ALFRED [28] associated with the allele frequencies.
Table 1

Name, source of data, and sample size for the 119 population samples*

PopulationAbbreviationNSource
BiakaBIA67Yale*
MbutiMBU39Yale*
MandenkaMND24HGDP*
LisongoLSG8Yale
YorubaYOR77Yale
YorubaYRIYRI113HapMap*
IboIBO48Yale
ZaramoZRM36Yale
HausaHAS39Yale
Bantu_NEBTN12HGDP*
Bantu_SBTS8HGDP*
SanSAN6HGDP*
Luhya LWKLWK90HapMap
African American 1AAM90Yale
African American ASWASW56HapMap
ChaggaCGA45Yale
Maasai, TMAS20Yale
Maasai MKKMKK144HapMap
SandaweSND40Yale
Ethiopian JewsETH32Yale
SomaliSML12Yale
MozabiteMOZ30HGDP*
KuwaitiKWT16Yale
SamaritansSAM40Yale
Yemenite JewsYMJ42Yale
Palestinian 1PLA-149Yale
Palestinian 2PLA-251HGDP*
Druze 1DRU-175Yale
Druze 2DRU-247HGDP*
BedouinBDN48HGDP*
Roman JewsRMJ26Yale
AdygeiADY54Yale*
GreeksGRK53Yale
Ashkenazi JewsASH79Yale
Tuscan 1Tus8HGDP
Tuscan TSITSI88Hapmap
Sardinian 1SRD-134Yale
Sardinian 2SRD-228HGDP
OrcadianORC16HGDP
North_ItalianITN13HGDP
French_BasqueFRB24HGDP*
FrenchFRN29HGDP
HungariansHGR89Yale
IrishIRI114Yale
European American 1EAM89Yale
European Amer CEUCEU115HapMap*
Russians 1RUA33Yale
Russians 2RUV47Yale*
FinnsFIN34Yale
DanesDAN51Yale
Komi ZyrianeKMZ47Yale
ChuvashCHV42Yale
Makrani 1MKR-226Yale
Makrani 2MKR-125HGDP
KalashKLS25HGDP*
BrahuiBRH25HGDP
BalochiBCH25HGDP*
SindhiSDI25HGDP
KeraliteKER30Yale
ThotiTHT14Yale
KachariKCH17Yale
Gujarati GIHGIH88HapMap
Pathan 1PTH-175Yale
Pathan 2PTH-223HGDP
MohannaMHN48HGDP
BurushoBSH25HGDP*
KhantyKTY50Yale
Hazara 1HZR-187Yale
Hazara 2HZR-224HGDP
Uygur 2UYG10HGDP*
Uygur 1UIG45Yale
KhazakKAZ44Yale
Khamba TibetanKHG27Yale
Mongolians 1MVF62Yale
Mongolians 2MGL10HGDP*
HmongBlackHMQ46Yale
BaimaDeeBQH40Yale
QiangQMR38Yale
HlaiLIC47Yale
YakutYAK51Yale*
DaiDAI10HGDP
LahuLHU10HGDP*
MiaozuMIZ10HGDP
NaxiNXI9HGDP
OroqenOQN10HGDP
SheSHE10HGDP
TuTU10HGDP
TujiaTUJ10HGDP
XiboXBO9HGDP
YizuYIZ10HGDP
DaurDUR9HGDP*
HezhenHEZ9HGDP
Han, SFHAN43HGDP
Han CHDCHD85HapMap
Han CHBCHB84HapMap*
Han, TaiwanCHT50Yale
HakkaHKA41Yale
KoreansKOR54Yale
JapaneseJPN50Yale
Japanese JPTJPT86HapMap*
LaotiansLAO118Yale
CambodiansCBD24Yale*
AmiAMI40Yale
AtayalATL42Yale
MalaysiansMLY11Yale
MicronesiansMCR34Yale
SamoansSMO8Yale
P-NG 1PNG13Yale
P-NG 2PNG17HGDP*
NasioiNAS22Yale
Mexican Amer MEXMEX49HapMap*
Pima MexicoPMM53Yale*
MayaMAY51Yale*
QuechuaQUE22Yale
ColombiansCOL-213HGDP*
GuihibaCOL-111Yale
TicunaTIC65Yale
Surui RSUR45Yale
KaritianaKAR55Yale

*These or subsets of these samples were included in Nassir et al. (2008). Descriptions of the populations and samples are in ALFRED.

Name, source of data, and sample size for the 119 population samples* *These or subsets of these samples were included in Nassir et al. (2008). Descriptions of the populations and samples are in ALFRED.

Marker Data

The polymorphic sites were those reported by Kosoy et al. [6]. The 3,071 samples from our lab were typed by TaqMan SNP Genotyping Assays® (Applied Biosystems, Foster City, California, USA). The HGDP marker data were downloaded from http://hagsc.org/hgdp/files.html[31]. The HapMap data were downloaded from http://hapmap.ncbi.nlm.nih.gov/index.html.yo[28]. Of the 128 SNPs typed for 119 population samples, only 16 instances (one AISNP for one population) of missing data existed in the public data. Eleven SNPs in seven HapMap 3 populations do not have genotype data available, and our estimates for those frequencies do not significantly affect the PCA results (Additional File 1).

Fst

Fst was calculated across all populations for each marker using the simple formula of Wright [32]: . For comparison, Fst was calculated for 2,327 other polymorphisms typed on our samples. None of these 2,327 polymorphisms included sites specifically selected for admixture or ancestry identification, or for individual identification; instead, they were all selected for other ongoing projects in our lab (i.e., linkage disequilibrium, disease/disorder association).

PCA

Principal component analysis (PCA) analyses of population sample allele frequencies were performed using XLSTAT (version 2009.4.07; Addinsoft SARL, http://www.xlstat.com/en/company/)) as one method to evaluate effectiveness of these SNPs for distinguishing among populations and to determine the major factors accounting for the population frequencies.

Structure

Structure (version 2.3.3; software freely available at http://pritch.bsd.uchicago.edu/structure.html[33-35]) was also used to evaluate and illustrate the effectiveness of these sites to distinguish among these populations. The burn-in was set at 20,000 followed by 10,000 iterations, and a model of correlated allele frequencies was specified. Ten replicates at each "K" levels 2-6 and 20 replicates at K = 7 and K = 8 were evaluated using CLUMPP; (software freely available at http://rosenberglab.bioinformatics.med.umich.edu/clumpp.html) [36]. Specific solutions have been plotted using DISTRUCT 1.1; software freely available at http://rosenberglab.bioinformatics.med.umich.edu/distruct.html) [37]. The matrix of pairwise similarities among replicate runs was used to identify different overall patterns based on high G values among runs with the "same" pattern and lower values for runs with different patterns.

Results

New data

The allele frequencies for the 128 AISNPs for all 119 population samples are given in Additional file 2, and the allele frequencies of the 69 population samples tested in our lab have all been entered into the ALFRED database [30] and can be readily accessed using the rs# of each SNP. The Fst distribution of the 128 AISNPs was compared to the distribution of 2,327 non-AISNPs typed in our lab (Figure 1 and Additional file 3). Although Kosoy et al. [6] selected their 128 AISNPs not on the basis of Fst, but rather on the Informativeness statistic (In) of Rosenberg et al. [38,39], Fst clearly separates the two distributions by 1.25 standard deviations. The null hypothesis that the two distributions are the same is rejected with a probability considerably less than 0.001. Outliers in the two distributions are given in Additional file 4. At the high-Fst end of the distributions, there are nine sites with Fst greater than 0.48: seven are in the reference distribution, and two are in the AISNP distribution. Of the seven in the reference distribution, five are located in or near genes of known phenotypic effect (SLC24A5, OCA2 (two SNPs), HERC2 and EDAR), and each of these genes is well known to have SNPs with marked global variation in allele frequency; but the best "known" SNPs are not part of this 128 AISNP set (Additional file 4). Though not associated with a phenotype, the remaining two "outliers" in the reference distribution have comparably high Fst values (Additional file 4). The two outliers at the high end of the AISNP distribution are sites in or near EDAR (rs260690, Fst = 0.5205) and RTTN (rs4891825, Fst = 0.5176). There are 10 outliers at the low end of the reference Fst distribution with Fst <0.04. Only one of the AISNPs falls below the mode of the reference distribution: TWGS1 (rs4798812, Fst = 0.08753).
Figure 1

Comparisons of Fst distributions for the 128 ancestry informative single-nucleotide polymorphisms (AISNPs) and for a reference set of 2327 SNPs.

Comparisons of Fst distributions for the 128 ancestry informative single-nucleotide polymorphisms (AISNPs) and for a reference set of 2327 SNPs. Figure 2 presents the first three factors of the PCA analysis based on allele frequencies of each of the 119 samples. The first two factors account for more than 72% of the variance. Factor 3 accounts for an additional 8.7% of the total variance. Factor 1 clearly separates the Native Americans from all other groups, and factor 2 clearly separates the African populations from the rest. Factor 3 emphasizes the difference between Native Americans and East Asians. This set of AISNPs was selected by Kosoy et al. [6] to maximize the differences among European Americans, Africans and Native Americans; those three groups clearly are at the vertices of the triangular pattern based on factors 1 and 2 (Figure 2). We also note that Eurasian populations show less clear separation.
Figure 2

Principal component analysis (PCA) of 119 population samples based on allele frequencies of 128 AISNPs.

Principal component analysis (PCA) of 119 population samples based on allele frequencies of 128 AISNPs. Results for the specific structure runs with the highest likelihood at each K value, K = 2-8, are shown in Figure 3 along with the number of times the particular overall pattern occurred. At all the K values, there are populations in each of the groups that seem quite homogeneous. By K = 8, the likelihoods began to plateau (Additional file 5), providing a statistically reasonable stopping point. For a better understanding of the ability of the data to distinguish most likely ancestry at the higher K values, we ran structure a total of 20 times, and the patterns seen more than once are illustrated in Figure 4; the patterns and likelihoods of the individual runs are given in Table 2. As is obvious from the patterns and the likelihoods of the individual runs, some distinctions are quite consistent while others generate similar likelihoods with combinatorial alternatives for a few different groups of populations.
Figure 3

.

Figure 4

The different patterns seen more than once in solutions from 20 runs of .

Table 2

Patterns and likelihoods of 20 structure runs at K = 7 and K = 8

Pattern K = 7LnP(D)RunBest per patternPattern K = 8LnP(D)RunBest per pattern
A-591354run13*A-590090run13*
B-591528run1*B-590185run1*
B-591555run2B-590570run6
A-591571run3B-590605run16
B-591707run12A-590606run15
B-591724run8C-590867run4*
C-591822run7*A-591033run18
A-591855run17A-591053run2
B-591944run15C-591080run20
C-591949run5A-591090run10
C-591957run9E-591160run5
C-592012run11C-591298run14
B-592017run4C-591371run3
D-592137run20*D-591512run7*
D-592272run18D-591689run17
E-592309run6A-591744run12
C-592342run16C-591745run8
C-592548run19F-592008run19
C-592605run14G-592162run11
D-593102run10H-592261run9
Patterns and likelihoods of 20 structure runs at K = 7 and K = 8 . The different patterns seen more than once in solutions from 20 runs of . At K = 7, there is no single solution clearly identifiable as best. Five different overall patterns occur in the 20 runs. The pattern illustrated in Figure 3 has the highest likelihood but is not the most common pattern. The next highest likelihood is nearly identical and occurs for a pattern that occurs in 6 of the 20 runs and differs by subdividing East Asian populations and not distinguishing the Pacific populations. The most frequent pattern, found for 7 of the 20 runs, does not have the highest likelihoods and differs in separating East African populations from the Pacific populations. At K = 8, eight different overall patterns occur among the 20 separate runs. The pattern shown is the most commonly found and does have the highest likelihood among the 20 runs. A nearly equal likelihood occurred for a somewhat different overall pattern that subdivides East Asia rather than separating East African populations. At K = 8, the results can be summarized with respect to the pattern shown in Figure 3 (pattern A in Figure 4). Starting from the left, the sub-Saharan Africans, especially the West Africans, seem relatively homogeneous (red). The East Africans, especially the MKK, Sandawe and Ethiopian Jews, can form a distinct grouping (pink), in which case other Tanzanian populations, the Maasai and Chagga, and the Somali (now living in Pakistan) appear intermediate between East and West Africa. The next consistent cluster includes the Mozabites and Southwest Asians (green). There is then a more-or-less gradient across Europe from southeastern and southern Europe (mostly green), through to northwestern Europe, and ending in northeastern Europe (mostly yellow). The south-central Asian populations form another (dark blue) consistent and relatively homogeneous cluster of populations, including East Indians and several Pakistani populations (dark blue). The Khazaks, Uyghur, Hazara and Khanty form a "group" that is depicted as admixed under any of the alternative common patterns. The next group of populations (dark gray) appears homogeneous from the Khamba-Tibet through Southeast Asia all the way to East Asia but the alternative (pattern B in Figure 4) has the western Chinese groups at one end of a more clinal pattern with the southeastern Asians at the other end. Interestingly, this alternative depicts the Han, Koreans and Japanese as admixed. The next clear cluster (light blue) is Pacific and consists of three Melanesian samples: The Samoans, Micronesians and Malaysians appear intermediate between East Asia and the Pacific. The final clear cluster (pink) consists of Native American samples. The different patterns at K = 7 and K = 8 show fine distinctions even among the regions that are superficially similar. To make some of these clearer, we have generated the population averages for the best result (highest likelihood) for each of the patterns (Figure 5). These emphasize the variation among individuals in each population sample by showing the population as multiple colors. These figures also emphasize the southwestern Asia through northern Europe cline seen in all patterns.
Figure 5

Average population assignment to clusters for . The data are the same as the K = 8 analysis in Figures 3 and 4.

Average population assignment to clusters for . The data are the same as the K = 8 analysis in Figures 3 and 4. Clusters that emerge at even higher values of K include a Pygmy/San/S Bantu cluster in Africa, a Khanty/Khazak/Yakut cluster in Asia and a vaguely central Asian group consisting of, for example, the Khamba-Tibet, Mongolian, Baima Dee and Qiang. These clusters, though reasonable, are not strongly supported statistically.

Discussion

A set of markers particularly useful for determining in detail the genetic distinctions among populations should also be useful in an examination of admixture. However, "admixture" is not a singular phenomenon: A sample of individuals might be considered admixed if it is composed of (1) samples from two or more different populations, (2) the descendants of people from two or more populations who have "recently" intermarried, (3) descendants of people from two or more populations who have intermarried in the ancient past and (4) people discretely sampled from a single region along a geographic allele frequency cline established predominantly by random genetic drift. The American Society of Human Genetics Ancestry and Ancestry Testing Task Force, in its white paper [40], sets forth caveats to be kept in mind in ancestry inference, perhaps the foremost of which is that ancestral populations cannot be observed directly and that even surrogates for those ancestral populations may not be included in any given study. Therefore, the "gold standard" analytic programs such as structure (version 2.3.3; http://pritch.bsd.uchicago.edu/structure.html[32-34]) will cause individuals in some populations to appear as an "admixture" of the population samples that are in the analysis. Even in analyses of principal components, it is not possible to distinguish whether a population is admixed or simply intermediate. Thus, a set of AIMs estimating the ancestry of an individual whose ancestry involves populations other than the majority of the populations in an analysis may be unsatisfactory by forcing that individual to be explained by the ancestry inferred for the majority. Further, a set of AIMs selected for one set of populations cannot be expected to be as good at distinguishing among other populations, perhaps even from the same geographic regions. It is important to realize that the outcome of any analysis of admixture or other population structure depends heavily on both the population samples and the markers used. Though we have included some of the same HGDP populations as Kosoy et al. [6] did in their analyses, the outcome is always a function of which samples are included. Thus, in our selection of samples, we have also included samples that overlap with those reported by Nassir et al. [27] as well as others not part of either the Kosoy et al. [6] or Nassir et al. [27] reports. The results shown in Figure 2 clearly reflect the criteria used to select this set of AISNPs [6]. The strongest discrimination reflects the geographic and ancestral origins of those populations (Africa, Europe, East Asia and the Americas), even though this analysis included none of the original samples used to select the SNPs. The first two components provide strong support for these SNPs in studies involving African, European and Native American populations. The relatively poorer separation among Eurasian and Pacific populations reflects the absence of Central, South and East Asian and Pacific populations in the selection of these AIMs as well as their distinct evolutionary relationships relative to African and Native American populations. It is logical to expect that if more SNPs with large allele frequency differences across Eurasia were included, factor 3 would show greater separation between west and east Eurasia. Structure attempts to find the set of K population allele frequencies that will give the best fit to all individual samples assuming Hardy-Weinberg ratios for each of the K populations. Structure does not consider or produce analyses of population relationships. Fortunately, this is not an issue of interest to forensic science. Rather, structure assigns individuals to clusters of genetically similar individuals. Obviously, if numbers of individuals differ greatly among different populations, a population sample with a large number of individuals will influence the allele frequencies of the particular cluster into which it falls more than a population with a small number of individuals. Thus, a small population from the middle of a cline with larger numbers in populations from the more extreme parts of the cline will appear "admixed." Such is seen at K = 3 for the South, Central and East Asian populations. However, a large population from that middle region will, at the same K value, cause the allele frequency estimates of the flanking clusters to move toward the center even if cluster assignments do not change. The consequences at higher K values may be that the "middle" population is a distinct group or, by shifting the estimates for the flanking clusters, cause a population at the extreme of the cline to "fall off" and become a separate cluster. The cluster assignments at K = 4 and K = 5 illustrate this (Figure 3). In other words, conclusions about groupings at a given value of K are a function of the populations sampled and their relative sample sizes. Thus, it is not necessarily correct that the estimated allele frequencies for a given cluster represent the ancestral population, nor can one automatically interpret a partial assignment to two or more clusters as admixture. In addition, as shown in Figure 4 and Table 2, there is a stochastic element in each structure run such that the relative likelihoods of different patterns from different runs depend on the particular outcomes that happen to occur. Thus, the point of using structure is not the single best run or the most common pattern seen, but the stability of aspects of the patterns and of the individual runs within each pattern among the runs with the higher likelihoods. Kalinowski [41] has recently published studies making additional relevant points on the interpretation of structure results. An example of the sample size effect appears to be found in Nassir et al. [27]. That study contains 49 populations with a total sample size of 1,620. Their Ashkenazi sample of 240 individuals (two population samples pooled: Ashkenazi AM 4 GP and Ashkenazi AM) constitutes about 15% of the total. Similarly, the 259 European Americans (two population samples pooled: European AM CEU and European AM NYCP) constitute about 16% of the total. These two heavily weighted population samples probably decrease the resolution of European and southwestern Asian populations. Our data set with the same sites and no population consisting of more than 6% (Han, pooling four population samples, CHB, CHD, SF and Taiwan = 5.4%) of the total sample can begin to distinguish a southwestern Asian cluster at K = 6, though showing a cline through Europe. Unfortunately, almost all of our East Asian samples, including many Chinese minorities, are de facto similar, with this set of AISNPs constituting the equivalent of nearly a quarter of our whole sample through K = 8, clearly affecting how South Asian and especially Central Asian populations appear. There are, however, differences among them sufficient to result in a more complex clinal pattern as a reasonable alternative at K = 7 and K = 8 (Figure 4). In the ideal world, a world we doubt exists, all samples would be large, equal in size and evenly distributed around the world.

Forensic Implications

Our analyses have been directed toward evaluating this set of SNPs for a particular purpose: ancestry inference as an investigatory tool. We have used PCA and structure for these evaluations. However, we do not advocate using either PCA or structure as a forensic tool for inference of individual ancestry in casework. Direct evaluation by likelihood methods is much more accurate. Any polymorphism can also be used to assist in matching crime scene and suspect DNA genotypes and to estimate the probability of the match occurring by chance if allele frequency data exist. Therefore, these 128 AISNPs could be used for exclusion, but we would not advise use of these markers to estimate the probability of a match occurring by chance. They have been selected to distinguish among populations and to have highly varying frequencies. To use these data in a court, one would have to present a diverse set of calculations and assumptions. The complexities of the calculations and the assumptions would allow an easy challenge, and all potential benefits of SNPs over the standard CODIS markers would be lost. There are good panels of SNPs selected for individual identification [e.g., [25,26]. The set of SNPs for individual identification that we developed [26] largely circumvents the problem of different allele frequencies in populations from different parts of the world. Similarly, we feel the 128 AISNPs analyzed in this paper are not efficient for any estimates of phenotype beyond the very indirect inference from ancestry. The data for these SNPs can be used to "assign" regional ancestry to a single individual based on the genotypes at all or a significant fraction of these 128 SNPs. This would be done by calculating the likelihood of the multisite genotype based on the allele frequencies of each of the 119 population samples (frequencies are in ALFRED [37]). It is clear that for many genotypes, many populations will have roughly comparable likelihoods. The clusters at K = 9-11 (not shown) indicate no new strongly supported subgroups of populations and suggest, for example, that differentiating ancestry from among populations within East Asia will not be easy using the allele frequencies for this set of SNPs. It is important to distinguish population averages from the variation among individuals (Additional files 6 and 7) within that population. Figure 5 presents the population averages for the K = 8 structure analysis. Compared to the variation among individuals shown in Figure 4, the averages make some of the global patterns clearer but completely obscure the individual variation that can be of great importance in a forensic setting. In a comparative examination of a total of seven small publicized AISNP panels containing a total of 688 SNPs, we found that only one SNP (rs2065160) occurred in three of the panels and 26 other SNPs (about 4%) occurred in two panels. None of the 128 SNPs in the panel we have analyzed occurred in any of these other panels. The small number of overlapping SNPs across panels likely results from the different methods of selecting SNPs, the different data sets from which SNPs are selected and the different purposes of the panels: some are global, some are regional and some are for the four continental extremes. However, though the specifics of these panels are not relevant, it is clear that there is no single set of AISNPs that will be of value for all questions. With our additional data and the analyses presented here, this panel of 128 AISNPs is the best documented and validated for broad global application to infer ancestry. However, it is not necessarily the optimal panel depending on the question being asked, and it is definitely not optimal at identifying ancestry within Europe and Southwest Asia (cf. Figures 3, 4 and 5; K = 6-8). Distinguishing among East Asian populations is also not optimal with this set of AISNPs. Neither of these conclusions is surprising, since populations in those regions were not part of the selection of this set of AISNPs. Selection to identify SNPs with markedly different allele frequencies across East Asia will be necessary [24]. Many useful SNPs must already exist; the problem is to identify them. In general, as more and more SNPs are identified through ongoing sequencing projects, other SNPs may be optimal for resolving population similarities within one of the major clusters in the structure analyses of Figures 3 and 4. However, comparison of the relative discriminating ability of additional candidate SNPs requires that all SNPs be typed on the same populations and, ideally, the same individuals. That will require coordination among laboratories and sharing of data and/or samples. We have put all of the allele frequencies of the populations we have studied in this paper into ALFRED [30]; the raw individual genotype data are available on request.

Competing interests

The authors declare that they have no competing interests. FMDLV is employed by Life Technologies.

Authors' contributions

All authors have read and approved the final manuscript. KKK and FMDLV were involved in the conception and design of the study, and KKK assisted in writing the manuscript. JRK supervised the genotyping assays and data analysis and wrote the manuscript. FRF and AJP assisted in the data analysis. WCS assembled and integrated the data set from the literature and laboratory.

Additional file 1

List of missing values and how they were handled in the PCA. Click here for file

Additional file 2

Population allele frequencies. Click here for file

Additional file 3

List of Fst values for all 128 AISNPs. Click here for file

Additional file 4

List of upper and lower outliers for Fst in Reference and AISNP distributions. Click here for file

Additional file 5

Likelihood plot K = 2-12. Click here for file

Additional file 6

Individual Assignments in . Click here for file

Additional file 7

Population Assignments in . Click here for file
  36 in total

1.  Inference of population structure using multilocus genotype data.

Authors:  J K Pritchard; M Stephens; P Donnelly
Journal:  Genetics       Date:  2000-06       Impact factor: 4.562

2.  A human genome diversity cell line panel.

Authors:  Howard M Cann; Claudia de Toma; Lucien Cazes; Marie-Fernande Legrand; Valerie Morel; Laurence Piouffre; Julia Bodmer; Walter F Bodmer; Batsheva Bonne-Tamir; Anne Cambon-Thomsen; Zhu Chen; J Chu; Carlo Carcassi; Licinio Contu; Ruofu Du; Laurent Excoffier; G B Ferrara; Jonathan S Friedlaender; Helena Groot; David Gurwitz; Trefor Jenkins; Rene J Herrera; Xiaoyi Huang; Judith Kidd; Kenneth K Kidd; Andre Langaney; Alice A Lin; S Qasim Mehdi; Peter Parham; Alberto Piazza; Maria Pia Pistillo; Yaping Qian; Qunfang Shu; Jiujin Xu; S Zhu; James L Weber; Henry T Greely; Marcus W Feldman; Gilles Thomas; Jean Dausset; L Luca Cavalli-Sforza
Journal:  Science       Date:  2002-04-12       Impact factor: 47.728

3.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies.

Authors:  Daniel Falush; Matthew Stephens; Jonathan K Pritchard
Journal:  Genetics       Date:  2003-08       Impact factor: 4.562

4.  The computer program STRUCTURE does not reliably identify the main genetic clusters within species: simulations and implications for human population structure.

Authors:  S T Kalinowski
Journal:  Heredity (Edinb)       Date:  2010-08-04       Impact factor: 3.821

5.  Algorithms for selecting informative marker panels for population assignment.

Authors:  Noah A Rosenberg
Journal:  J Comput Biol       Date:  2005-11       Impact factor: 1.479

6.  Candidate SNPs for a universal individual identification panel.

Authors:  Andrew J Pakstis; William C Speed; Judith R Kidd; Kenneth K Kidd
Journal:  Hum Genet       Date:  2007-02-27       Impact factor: 4.132

7.  A genomewide admixture mapping panel for Hispanic/Latino populations.

Authors:  Xianyun Mao; Abigail W Bigham; Rui Mei; Gerardo Gutierrez; Ken M Weiss; Tom D Brutsaert; Fabiola Leon-Velarde; Lorna G Moore; Enrique Vargas; Paul M McKeigue; Mark D Shriver; Esteban J Parra
Journal:  Am J Hum Genet       Date:  2007-04-20       Impact factor: 11.025

8.  CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure.

Authors:  Mattias Jakobsson; Noah A Rosenberg
Journal:  Bioinformatics       Date:  2007-05-07       Impact factor: 6.937

9.  Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America.

Authors:  Roman Kosoy; Rami Nassir; Chao Tian; Phoebe A White; Lesley M Butler; Gabriel Silva; Rick Kittles; Marta E Alarcon-Riquelme; Peter K Gregersen; John W Belmont; Francisco M De La Vega; Michael F Seldin
Journal:  Hum Mutat       Date:  2009-01       Impact factor: 4.878

10.  Large-scale SNP analysis reveals clustered and continuous patterns of human genetic variation.

Authors:  Mark D Shriver; Rui Mei; Esteban J Parra; Vibhor Sonpar; Indrani Halder; Sarah A Tishkoff; Theodore G Schurr; Sergev I Zhadanov; Ludmila P Osipova; Tom D Brutsaert; Jonathan Friedlaender; Lynn B Jorde; W Scott Watkins; Michael J Bamshad; Gerardo Gutierrez; Halina Loi; Hajime Matsuzaki; Rick A Kittles; George Argyropoulos; Jose R Fernandez; Joshua M Akey; Keith W Jones
Journal:  Hum Genomics       Date:  2005-06       Impact factor: 4.639

View more
  45 in total

1.  Genetic variation and population structure of American mink Neovison vison from PCB-contaminated and non-contaminated locales in eastern North America.

Authors:  Isaac Wirgin; Lorraine Maceda; John Waldman; David T Mayack
Journal:  Ecotoxicology       Date:  2015-09-15       Impact factor: 2.823

2.  Inference of biogeographical ancestry across central regions of Eurasia.

Authors:  O Bulbul; G Filoglu; T Zorlu; H Altuncul; A Freire-Aradas; J Söchtig; Y Ruiz; M Klintschar; S Triki-Fendri; A Rebai; C Phillips; M V Lareu; Á Carracedo; P M Schneider
Journal:  Int J Legal Med       Date:  2015-08-20       Impact factor: 2.686

3.  Population data of 30 insertion-deletion markers in four Chinese populations.

Authors:  Meisen Shi; Yaju Liu; Rufeng Bai; Lizhe Jiang; Xiaojiao Lv; Shuhua Ma
Journal:  Int J Legal Med       Date:  2014-10-14       Impact factor: 2.686

4.  A single-tube 27-plex SNP assay for estimating individual ancestry and admixture from three continents.

Authors:  Yi-Liang Wei; Li Wei; Lei Zhao; Qi-Fan Sun; Li Jiang; Tao Zhang; Hai-Bo Liu; Jian-Gang Chen; Jian Ye; Lan Hu; Cai-Xia Li
Journal:  Int J Legal Med       Date:  2015-04-02       Impact factor: 2.686

5.  A panel of 130 autosomal single-nucleotide polymorphisms for ancestry assignment in five Asian populations and in Caucasians.

Authors:  Hsiao-Lin Hwa; Chih-Peng Lin; Tsun-Ying Huang; Po-Hsiu Kuo; Wei-Hsin Hsieh; Chun-Yen Lin; Hsiang-I Yin; Li-Hui Tseng; James Chun-I Lee
Journal:  Forensic Sci Med Pathol       Date:  2017-04-24       Impact factor: 2.007

6.  Population structure of Han Chinese in the modern Taiwanese population based on 10,000 participants in the Taiwan Biobank project.

Authors:  Chien-Hsiun Chen; Jenn-Hwai Yang; Charleston W K Chiang; Chia-Ni Hsiung; Pei-Ei Wu; Li-Ching Chang; Hou-Wei Chu; Josh Chang; I-Wen Song; Show-Ling Yang; Yuan-Tsong Chen; Fu-Tong Liu; Chen-Yang Shen
Journal:  Hum Mol Genet       Date:  2016-12-15       Impact factor: 6.150

7.  Exploring the ancestry differentiation and inference capacity of the 28-plex AISNPs.

Authors:  Wei-Qi Hao; Jing Liu; Li Jiang; Jun-Ping Han; Ling Wang; Jiu-Ling Li; Quan Ma; Chao Liu; Hui-Jun Wang; Cai-Xia Li
Journal:  Int J Legal Med       Date:  2018-06-07       Impact factor: 2.686

8.  Beyond STRs: The Role of Diallelic Markers in Forensic Genetics.

Authors:  Peter M Schneider
Journal:  Transfus Med Hemother       Date:  2012-05-15       Impact factor: 3.747

9.  Mutational Landscape of Aggressive Prostate Tumors in African American Men.

Authors:  Karla J Lindquist; Pamela L Paris; Thomas J Hoffmann; Niall J Cardin; Rémi Kazma; Joel A Mefford; Jeffrey P Simko; Vy Ngo; Yalei Chen; Albert M Levin; Dhananjay Chitale; Brian T Helfand; William J Catalona; Benjamin A Rybicki; John S Witte
Journal:  Cancer Res       Date:  2016-02-26       Impact factor: 12.701

10.  Ancestry informative markers for distinguishing between Thai populations based on genome-wide association datasets.

Authors:  Kornkiat Vongpaisarnsin; Jennifer Beth Listman; Robert T Malison; Joel Gelernter
Journal:  Leg Med (Tokyo)       Date:  2015-02-25       Impact factor: 1.376

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.