| Literature DB >> 27654910 |
Luca Pagani1,2,3, Daniel John Lawson4, Evelyn Jagoda2,5, Alexander Mörseburg2, Anders Eriksson6,7, Richard Villems1,8,9, Eske Willerslev10, Toomas Kivisild2,1, Mait Metspalu1, Mario Mitt11,12, Florian Clemente2,13, Georgi Hudjashov1,14,15, Michael DeGiorgio16, Lauri Saag1, Jeffrey D Wall17, Alexia Cardona2,18, Reedik Mägi11, Melissa A Wilson Sayres19,20, Sarah Kaewert2, Charlotte Inchley2, Christiana L Scheib2, Mari Järve1, Monika Karmin1,8, Guy S Jacobs21,22, Tiago Antao23, Florin Mircea Iliescu2, Alena Kushniarevich1,24, Qasim Ayub25, Chris Tyler-Smith25, Yali Xue25, Bayazit Yunusbayev1,26, Kristiina Tambets1, Chandana Basu Mallick1, Lehti Saag8, Elvira Pocheshkhova27, George Andriadze28, Craig Muller10, Michael C Westaway29, David M Lambert29, Grigor Zoraqi30, Shahlo Turdikulova31, Dilbar Dalimova32, Zhaxylyk Sabitov33, Gazi Nurun Nahar Sultana34, Joseph Lachance35,36, Sarah Tishkoff37, Kuvat Momynaliev38, Jainagul Isakova39, Larisa D Damba40, Marina Gubina40, Pagbajabyn Nymadawa41, Irina Evseeva42,43, Lubov Atramentova44, Olga Utevska44, François-Xavier Ricaut45, Nicolas Brucato45, Herawati Sudoyo46, Thierry Letellier45, Murray P Cox15, Nikolay A Barashkov47,48, Vedrana Skaro49,50, Lejla Mulahasanovic51, Dragan Primorac52,53,54,50, Hovhannes Sahakyan1,55, Maru Mormina56, Christina A Eichstaedt2,57, Daria V Lichman40,58, Syafiq Abdullah59, Gyaneshwer Chaubey1, Joseph T S Wee60, Evelin Mihailov11, Alexandra Karunas26,61, Sergei Litvinov26,61,1, Rita Khusainova26,61, Natalya Ekomasova61, Vita Akhmetova26, Irina Khidiyatova26,61, Damir Marjanović62,63, Levon Yepiskoposyan55, Doron M Behar1, Elena Balanovska64, Andres Metspalu7,11, Miroslava Derenko65, Boris Malyarchuk65, Mikhail Voevoda66,40,58, Sardana A Fedorova48,47, Ludmila P Osipova40,58, Marta Mirazón Lahr67, Pascale Gerbault68, Matthew Leavesley69,70, Andrea Bamberg Migliano71, Michael Petraglia72, Oleg Balanovsky73,64, Elza K Khusnutdinova26,61, Ene Metspalu1,8, Mark G Thomas68, Andrea Manica7, Rasmus Nielsen74.
Abstract
High-coverage whole-genome sequence studies have so far focused on a limited number of geographically restricted populations, or been targeted at specific diseases, such as cancer. Nevertheless, the availability of high-resolution genomic data has led to the development of new methodologies for inferring population history and refuelled the debate on the mutation rate in humans. Here we present the Estonian Biocentre Human Genome Diversity Panel (EGDP), a dataset of 483 high-coverage human genomes from 148 populations worldwide, including 379 new genomes from 125 populations, which we group into diversity and selection sets. We analyse this dataset to refine estimates of continent-wide patterns of heterozygosity, long- and short-distance gene flow, archaic admixture, and changes in effective population size through time as well as for signals of positive or balancing selection. We find a genetic signature in present-day Papuans that suggests that at least 2% of their genome originates from an early and largely extinct expansion of anatomically modern humans (AMHs) out of Africa. Together with evidence from the western Asian fossil record, and admixture between AMHs and Neanderthals predating the main Eurasian expansion, our results contribute to the mounting evidence for the presence of AMHs out of Africa earlier than 75,000 years ago.Entities:
Mesh:
Year: 2016 PMID: 27654910 PMCID: PMC5164938 DOI: 10.1038/nature19792
Source DB: PubMed Journal: Nature ISSN: 0028-0836 Impact factor: 49.962
ED1Sample Diversity and Archaic signals.
A: Map of location of samples highlighting the Diversity/Selection Sets; B: ADMIXTURE plot (K=8 and 14) which relates general visual inspection of genetic structure to studied populations and their region of origin; C: Sample level heterozygosity is plotted against distance from Addis Ababa. The trend line represents only non-African samples. The inset shows the waypoints used to arrive at the distance in kilometres for each sample. D: Boxplots were used to visualize the Denisova (red), Altai (green) and Croatian Neanderthal (blue) D distribution for each regional group of samples. Oceanian Altai D values show a remarkable similarity with the Denisova D values for the same region, in contrast with the other groups of samples where the Altai boxplots tend to be more similar to the Croatian Neanderthal ones.
ED2Data quality checks and heterozygosity patterns.
Concordance of DNA sequencing (Complete Genomics Inc.) and DNA genotyping (Illumina genotyping arrays) data (ref-ref; het-ref-alt and hom-alt-alt, see SI 1.6) from chip (A) and sequence data (B). Coverage (depth) distribution of variable positions, divided by DNA source (Blood or Saliva) and Complete Genomic calling pipeline (release version) (C). Genome-wide distribution of Transition/Transversion ratio subdivided by DNA source (Saliva or Blood) and by Complete Genomic calling pipeline (D). Genome-wide distribution of Transition/Transversion ratio subdivided by chromosomes (E). Inter-chromosome differences in observed heterozygosity in 447 samples from the Diversity Set (F). Inter-chromosome differences in observed heterozygosity in a set of 50 unpublished genomes from the Estonian Genome Center, sequenced on an Illumina platform at an average coverage exceeding 30x (G). Inter-chromosome differences in observed heterozygosity in the phase 3 of the 1000 Genomes Project (H). The total number of observed heterozygous sites was divided by the number of accessible basepairs reported by the 1000 Genomes Project.
ED3FineSTRUCTURE shared ancestry analysis.
ChromoPainter and FineSTRUCTURE results, showing both inferred populations with the underlying (averaged) number of haplotypes that an individual in a population receives (rows) from donor individuals in other populations (columns). 108 populations are inferred by FineSTRUCTURE. The dendrogram shows the inferred relationship between populations. The numbers on the dendrogram give the proportion of MCMC iterations for which each population split is observed (where this is less than 1). Each “geographical region” has a unique colour from which individuals are labeled. The number of individuals in each population is given in the label; e.g. “4Italians; 3Albanians” is a population of size 7 containing 4 individuals from Italy and 3 from Albania.
ED4MSMC genetic split times and outgroup f3 results.
The MSMC split times estimated between each sample and a reference panel of 9 genomes were linearly interpolated to infer the broader square matrix (A). Summary of outgroup f3 statistics for each pair of non-African populations (B) or to an ancient sample (C) using Yoruba as an outgroup. Populations are grouped by geographic region and are ordered with increasing distance from Africa (left to right for columns and bottom to top for rows). Colour bars at the left and top of the heat map indicate the colour coding used for the geographical region. Individual population labels are indicated at the right and bottom of the heat map. The f3 statistics are scaled to lie between 0 and 1, with a black colour indicating those close to 0 and a red colour indicating those close to 1. Let m and M be the minimum and maximum f3 values within a given row (i.e., focal population). That is, for focal population X (on rows), m = minY,Y≠X f3(X, Y ; Yoruba) and M = maxY,Y≠X f3(X, Y ; Yoruba). The scaled f3 statistic for a given cell in that row is given by f3scaled=(f3-m)/(M-m), so that the smallest f3 in the row has value f3scaled=0 (black) and the largest has value f3scaled=1 (red). By default, the diagonal has value f3scaled=1 (red). The heat map is therefore asymmetric, with the population closest to the focal population at a given row having value f3scaled=1 (red colour) and the population farthest from the focal population at a given row having value f3scaled=0 (black colour). Therefore, at a given row, scanning the columns of the heat map reveals the populations with the most shared ancestry with the focal population of that row in the heat map.
Figure 1Genetic barriers across space.
Spatial visualisation of genetic barriers inferred from genome-wide genetic distances, quantified as the magnitude of the gradient of spatially interpolated allele frequencies (value denoted by colour bar; grey areas have been land during the last glacial maximum but are currently under water). Here we used a spatial kernel smoothing method based on the matrix of pairwise average heterozygosity a matlab script that plots the hexagons of the grid with a colour coding to represent gradients Inset: partial correlation between magnitude of genetic gradients and combinations of different geographic factors, elevation (E), temperature (T) and precipitation (P), for genetic gradients from fineSTRUCTURE (red) and allele frequencies (blue). This analysis (SI1:2.2.2 for details) shows that genetic differences within this region display some correlation with physical barriers such as mountain ranges, deserts, forests, and open water (such as the Wallace line).
ED5Geographical patterns of genetic diversity.
Isolation by distance pattern across areas of high genetic gradient, using Europe as a baseline. The samples used in each analysis are indicated by coloured lines on the maps to the right of each plot. The panels show FST as a function of distance across the Himalayas (A), the Ural mountains (B), and the Caucasus (C) as reported on the color-coded map (D). Effect of creating gaps in the samples in Europe (E): we tested the effect of removing samples from stripes, either north to south (F) or west to east (G), to create gaps comparable in size to the gaps in samples in the dataset. Effective migration surfaces inferred by EEMS (H).
ED6Summary of positive selection results
Barplot comparing frequency distributions of functional variants in Africans and non-Africans (A). The distribution of exonic SNPs according to their functional impact (synonymous, missense and nonsense) as a function of allele frequency. Note that the data from both groups was normalised for a sample size of n=21 and that the Africans show significantly (Chisq p-value <10-15) more rare variants across all sites classes.
Result (B) of 1000 bootstrap replica of the Rxy test for a subset of pigmentation genes highlighted by GWAS (n=32). The horizontal line provides the African reference (x=1) against which all other groups are compared. The blue and red marks show the 95th and the 5th percentile of the bootstrap distributions respectively. If the 95th percentile is below 1, then the population shows a significant excess of missense variants in the pigmentation subset relative to the Africans. Note that this is the case for all non-Africans except the Oceanians. Pools (C) of individuals for selection scans. fineSTRUCTURE based coancestry matrix was used to define twelve groups of populations for the downstream selection scans. These groups are highlighted in the plot by boxes with broken line edges. The number of individuals in each group is reported in Table SI2:3.2-I.
ED7Length of haplotypes assigned as African by fineSTRUCTURE as a function of genome proportion.
A: 447 Diversity Panel results, showing label averages (large crosses) along with individuals (small dots). B: Relative excluded Diversity Panel results, to check for whether including related individuals affects African genome fraction. Individuals that shared more than 2% of genome fraction were forbidden from receiving haplotypess from each other, and the painting was re-run on a large subset of the genome (all ROH regions from any individual). C: ROH only African haplotypes. To guard against phasing errors, we analysed only regions for which an individual was in a long (>500kb) Run of Homozygosity using the PLINK command “--homozyg-window-kb 500000 --homozyg-window-het 0 --homozyg-density 10”. Because there are so few such regions, we report only the population average for populations with two or more individuals, as well as the standard error in that estimate. Populations for whom the 95% CI passed 0 were also excluded. Note the logarithmic axis. D: Ancient DNA panel results. We used a different panel of 109 individuals which included 3 ancient genomes. We painted Chromosomes 11, 21 & 22 and report as crosses the population averages for populations with 2 or more individuals. The solid thin lines represent the position of each population when modern samples only are analysed. The dashed lines lead off the figure to the position of the ancient hominins and the African samples.
Figure 2Evidence of an xOoA signature in the genomes of modern Papuans.
Panel A: MSMC split times plot. The Yoruba-Eurasia split curve shows the mean of all Eurasian genomes against one Yoruba genome. The grey area represents top and bottom 5% of runs. We chose a Koinanbe genome as representative of the Sahul populations. Panels B-D: Decomposition of Papuan haplotypes inferred as African by fineSTRUCTURE. Panel B: Semi-parametric decomposition of the joint distribution of haplotype lengths and non-African derived allele rate per SNP, showing the relative proportion of haplotypes in K=20 components of the distribution, ordered by non-African derived allele rate, relative to the overall proportion of haplotypes in each component. The four datasets produced by considering haplotypes inferred as (African/Denisova) in (Europeans/Papuans) are shown with our inferred "extra Out-of-Africa xOoA" component. Panel C: The properties of the components in terms of non-African derived allele rate, on which the components are ordered, and length.
Panel D: The reconstruction of haplotypes inferred as African in the genomes of Papuan individuals, using a mixture of all other data (red) and with the addition of the xOoA signature (black).
ED8MSMC Linear behavior of MSMC split estimates in presence of admixture.
The examined Central Asian (A), East African (B), and African-American (C) genomes yielded a signature of MSMC split time (Truth, left-most column) that could be recapitulated (Reconstruction, second left most column) as a linear mixture of other MSMC split times. The admixture proportions inferred by our method (top of each admixture component column) were remarkably similar to the ones previously reported from the literature.
MSMC split times (D) calculated after re-phasing an Estonian and a Papuan (Koinanbe) genome together with all the available West African and Pygmy genomes from our dataset to minimize putative phasing artefacts. The cross coalescence rate curves reported here are quantitatively comparable with the ones of Figure 2 A, hence showing that phasing artefacts are unlikely to explain the observed past-ward shift of the Papuan-African split time. Boxplot (E) showing the distribution of differences between African-Papuan and African-Eurasian split times obtained from coalescent simulations assembled through random replacement to make 2000 sets of 6 individuals (to match the 6 Papuans available from our empirical dataset), each made of 1.5 Gb of sequence. The simulation command line used to generate each chromosome made of 5Mb was as follows, being *DIV*=0.064; 0.4 or 0.8 for the xOoA, Denisova (Den) and Divergent Denisova (DeepDen) cases, respectively: ms0ancient2 10 1 .065 .05 -t 5000. -r 3000. 5000000 -I 7 1 1 1 1 2 2 2 -en 0. 1 .2 -en 0. 2 .2 -en 0. 3 .2 -en 0. 4 .2 -es .025 7 .96 -en .025 8 .2 -ej .03 7 6 -ej .04 6 5 -ej .060 8 3 -ej .061 4 3 -ej .062 2 1 -ej .063 3 1 -ej *DIV* 1 5
ED9Modelling the xOoA components with FineSTRUCTURE.
A: Joint distribution of haplotype lengths and Derived allele count, showing the median position of each cluster and all haplotypes assigned to it in the Maximum A Posteriori (MAP) estimate. Note that although a different proportion of points is assigned to each in the MAP, the total posterior is very close to 1/K for all. The dashed lines show a constant mutation rate. Haplotypes are ordered by mutation rate from low to high. B: Residual distribution comparison between the two component mixture using EUR.AFR and EUR.PNG (left), and the three component mixture including xOoA (using the same colour scale) (right). The residuals without xOoA are larger (RMSE 0.0055 compared to RMSE 0.0018) but more importantly, they are also structured. C: Assuming a mutational clock and a correct assignment of haplotypes, we can estimate the relative age of the splits from the number of derived alleles observed on the haplotypes. This leads to an estimate of 1.5 times older for xOoA compared to the Eurasian-Africa split.
ED10Proposed xOoA model.
A subway map figure illustrating, as suggested by the novel results presented here, a model of an early, extinct Out-of-Africa (xOoA) signature in the genomes of Sahul populations at their arrival in the region. Given the overall small genomic contribution of this event to the genomes of modern Sahul individuals, we could not determine whether the documented Denisova admixture (question marks) and putative multiple Neanderthal admixtures took place along this extinct OoA. We also speculate (question mark) people who migrated along the xOoA route may have left a trace in the genomes of the Altai Neanderthal as reported by Kuhlwilm and colleagues12.