Literature DB >> 24275494

1000 Genomes Selection Browser 1.0: a genome browser dedicated to signatures of natural selection in modern humans.

Marc Pybus¹, Giovanni M Dall'Olio, Pierre Luisi, Manu Uzkudun, Angel Carreño-Torres, Pavlos Pavlidis, Hafid Laayouni, Jaume Bertranpetit, Johannes Engelken.

Abstract

Searching for Darwinian selection in natural populations has been the focus of a multitude of studies over the last decades. Here we present the 1000 Genomes Selection Browser 1.0 (http://hsb.upf.edu) as a resource for signatures of recent natural selection in modern humans. We have implemented and applied a large number of neutrality tests as well as summary statistics informative for the action of selection such as Tajima's D, CLR, Fay and Wu's H, Fu and Li's F* and D*, XPEHH, ΔiHH, iHS, F(ST), ΔDAF and XPCLR among others to low coverage sequencing data from the 1000 genomes project (Phase 1; release April 2012). We have implemented a publicly available genome-wide browser to communicate the results from three different populations of West African, Northern European and East Asian ancestry (YRI, CEU, CHB). Information is provided in UCSC-style format to facilitate the integration with the rich UCSC browser tracks and an access page is provided with instructions and for convenient visualization. We believe that this expandable resource will facilitate the interpretation of signals of selection on different temporal, geographical and genomic scales.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2013 PMID： 24275494 PMCID： PMC3965045 DOI： 10.1093/nar/gkt1188

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

Initiatives such as the 1000 Genomes Project (1,2) are generating resequencing data from world-wide human populations on a genome-wide scale. Resequencing data constitutes a major leap for population genomic analysis due to its higher information density and limited SNP ascertainment bias compared to genotyping data. Therefore such data is appropriate to calculate summary statistics that are based on the site frequency spectrum like CLR or Tajima’s D. Using the neutral evolutionary model as a null hypothesis, diverse statistics can be applied to genetic data to identify deviations from neutrality (Table 1). These statistical tests show varying degrees of robustness to demographic events (e.g. population bottlenecks and expansions) and sensitivity to different types of selection (e.g. positive, purifying or balancing). For instance, population bottlenecks, can lead to footprints that are similar to those caused by positive selection (21). Therefore, outlier approaches, which are commonly used to identify non-neutral loci in the extremes of a genome-wide distribution, are likely to contain a number of false positives in their extremes. Likewise, a number of false negatives, hence misidentified truly selected loci, are expected in a grey zone near the (arbitrary) outlier threshold (22). Outlier approaches in genome scans have proven powerful, but certainly they should be interpreted carefully in order to avoid storytelling (23). Even more, a profound understanding of adaptive evolution requires the integration of biological function (24) and if possible, validation on an experimental basis (25). Molecular network approaches can also give a functional context to the specific genes under adaptive selection (26,27). In all studies, care should be taken in communicating putative loci under selection to the public in order to avoid racist misinterpretation (28). Despite of these limitations and the fact that complete selective sweeps may not be extremely widespread in humans (29), a large number of regions under strong positive selection can be expected in the genome (30).

Table 1.

List of available summary statistics

Method family	Method	Reference	Window size	Rank scores tail
Allele frequency spectrum	Tajima’s D	Tajima (3)	30 kb	Lower
	CLR	Nielsen et al. (4)	Variable size	Upper
	Fay and Wu’s H	Fay and Wu (5)	30 kb	Lower
	Fu and Li’s F*	Fu and Li (6)	30 kb	Lower
	Fu and Li’s D*	Fu and Li (6)	30 kb	Lower
	R²	Ramos-Onsins and Rozas (7)	30 kb	Lower
Linkage disequilibrium structure	XP-EHH	modified from Sabeti et al. (8)	SNP-specific	Upper
	AiHH	modified from Voight et al. (9)	SNP-specific	Upper
	his	modified from Voight et al. (9)	SNP-specific	Upper
	EHH_average	modified from Sabeti et al. (10)	30 kb	Upper
	EHH_max	modified from Sabeti et al. (10)	30 kb	Upper
	Wall’s B	Wall (11)	30 kb	Upper
	Wall’s Q	Wall (12)	30 kb	Upper
	Fu’s F	Fu (13)	30 kb	Lower
	Dh	Nei (14)	30 kb	Upper
	Za	Rozas et al. (15)	30 kb	Upper
	ZnS	Kelly (16)	30 kb	Upper
	ZZ	Rozas et al. (15)	30 kb	Upper
Population differentiation	Fst (global and pairwise)	Weir and Cockerham (17)	SNP-specific	Upper
	ΔDAF (standard and absolute)	Hofer et al. (18)	SNP-specific	Upper
	XP-CLR	Chen et al. (19)	0.1 cM (maximum window)	Upper
Descriptive statistics	Segregating sites		30 kb	NA
	Singletons		30 kb	NA
	pi (nucleotide diversity)	Nei and Li (20)	30 kb	NA
	DAF (derived allele frequency)		SNP-specific	NA
	MAF (minor allele frequency)		SNP-specific	NA

List of available summary statistics

DESCRIPTION OF APPLIED STATISTICAL TESTS

Due to linkage, neutral alleles in the surrounding region hitchhike with the selected allele. Maynard Smith and Haigh (31) described this process of genetic hitchhiking and the so-called selective sweep. More recent studies showed that genetic hitchhiking generates distinct polymorphism signatures on the genome such as: (i) reduction of polymorphism level and excess of low- and high-frequency derived variants (32), (ii) spatial patterns of linkage-disequilibrium (33) and (iii) increased genetic differentiation among populations (34). Taking advantage of these three theoretical expectations, several methods to detect positive selection have been developed in the last two decades. This makes reference to the fact that no single statistic is enough to describe selection under various demographic models and modes of selection (22). Here, we implemented a large number of statistical tests (Table 1) in order to allow for a more comprehensive analysis of natural selection, especially, positive selection. In brief, we have assigned the statistical tests to different method families (Table 1). Within the first family which is based on the allele frequency spectrum, Tajima’s D (3) is a classical neutrality test that compares estimates of the number of segregating sites and the mean pair-wise difference between sequences. CLR is a multi-locus, composite likelihood ratio test (4,35). Fay and Wu’s H (5) uses another facet of the site-frequency spectrum, by comparing the number of derived segregating sites at high frequencies to the number of variants at intermediate frequencies. Fu and Li’s F* compares the number of singletons to the mean pair-wise difference between sequences and Fu and Li’s D* compares it to the total number of nucleotide variants in a genomic region (6). R2 (7) is a statistical test for detecting population growth based on the comparison of the difference between the number of singletons per sequence and the average number of nucleotide differences. Among the linkage disequilibrium structure methods, XP-EHH (8) is a cross-population test based on extended haplotype homozygosity (EHH). ΔiHH considers the difference between the integrated haplotype homozygosity scores for each allele in a single population while iHS (9) is defined as their log ratio. EHH average and EHH maximum (36); modified from (10) are based on the extended haplotype homozygosity. Wall’s B (11) counts the number of pairs of adjacent segregating sites that are congruent (if the subset of the data consisting of the two sites contains only two different haplotypes), while Wall’s Q (12) adds the number of partitions (two disjoint subsets whose union is the set of individuals in the sample) induced by congruent pairs to Wall’s B. Fu’s F (13) takes into account the haplotype diversity in the sample. Dh (14) is a summary statistic based on the number of different haplotypes in the sample. The third family of methods is based on population differentiation. FST (37); calculated following the diploid method in Weir 1996 (p. 178) and ΔDAF (18) are estimates of population differentiation based on derived allele frequencies. XP-CLR (19) is a multi-locus allele-frequency-differentiation statistic between two populations. Additional statistics like segregating sites per 30-kb window and the nucleotide diversity and others (Table 1) are listed as descriptive statistics. A thorough description of the tests is given in the original literature (see Table 1) and in diverse excellent reviews on the topic (38,39).

COMPUTATIONAL FRAMEWORK AND DESCRIPTION OF 1000 GENOMES SOURCE DATA

A framework to calculate diverse summary statistics (Table 1) from 1000 genomes data was developed (Figure 1). A detailed description of how the statistics were implemented is given (Supplementary Material). A genome-wide overview of the results stored in the database for selected summary statistics is given (Supplementary Table S1). As described in the 1000 genomes Phase 1 paper (1), the quality of the 1000 genomes low coverage data has improved considerably over the pilot phase (2), but a number of limitations need to be kept in mind for population genomic analysis: (i) singletons and other rare variants are still underrepresented, (ii) the accessibility of the genome with the used short-read-sequencing technologies ∼94% and (iii) the reported phasing switch error every 250 kb (median, Supplementary Figure S5 in (1)) likely underestimates the length of long-shared haplotypes expected to occur around recent selective sweeps. Despite of these drawbacks which are mainly due to the nature of the low coverage approach, the short-read technology and differences in read depth (40), this dataset has important advantages over genotyping data, most importantly (i) a higher SNP density, (ii) the overcoming of ascertainment bias and (iii) a larger number of individuals per population, when compared to previous datasets (HapMap II and HGDP). We used phased data from the CEU, the CHB and the YRI populations from the integrated Phase 1 variant set (April 2012), with 97, 85 and 88 individuals, respectively. From the input vcf (variant call format) file we extracted exclusively the low-coverage VSQR SNP calls in order to avoid any bias that might result from differences between low-coverage calls and high-coverage exome SNP calls. Indels were not used. Ancestral states in this data set were identified using a 4-way alignment of humans, chimp, orangutan and rhesus macaque, provided by the 1000 genomes consortium (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/supporting/ancestral_alignments/).

Figure 1.

Schematic workflow developed in order to calculate diverse genome-wide summary statistics informative for the action of selection and to build a database in order to share and visualize the results.

AVAILABILITY OF DATABASE

All data is available via our entry page: http://hsb.upf.edu. A search mask gives the user easy access to the results for a specific gene or a genomic region of choice. The ‘submit’ button leads the user to a UCSC-style genome browser (http://pgb.ibe.upf.edu/) which is a custom installation of the UCSC Genome Browser (41,42). This UCSC Genome Browser installment allows for a visual inspection of the data, and for an integration of our data with many other available datasets. The raw scores of the tracks can be conveniently downloaded using the UCSC Table function (43) and is integrated with the Galaxy platform (galaxyproject.org). Using the ‘configure’ function on the browser page, the tracks can be further customized and using ‘right click’ the visualized genomic regions can be downloaded as a picture in .png format. For every statistical test, we provide two tracks, one for the raw scores and one for ranked scores. The purpose of the rank score tracks is to provide a comparison to the rest of the genome. Conveniently, the rank scores are presented in such a way that they present a peak (instead of a valley) in regions under positive selection. They are calculated using an outlier approach (22,44) by sorting all the scores genome-wide and determining the −log10 of the rank divided by the number of values in the distribution, taking the upper tail for most of the tests, or the lower tail for Tajima’s D, Fay and Wu’s H, Fu and Li’s F and D, R2 and Fu’s F (see Table 1 and a more detailed description on the entry page). The main purpose of the entry page is to provide a channel of communication with users, following the guidelines in (45). It serves as a platform for updates, questions and feedback (46). Therefore the page also provides documentation on the tracks and on the tests implemented as well as a FAQ and a feedback section.

EXAMPLE APPLICATIONS

First, we exemplify the use of the database by extracting results for a number of established loci under selection: EDAR (47), LCT (46), SLC45A2 (48), CD36 (49), HERC2 (50), SLC24A5 (51), CD5 (52) and APOL1 (53). A loci-specific summary of statistical tests is given (Supplementary Table S2). Interestingly, for any given locus, only a subset of statistical tests shows an extreme outlier score. This is consistent with differences in the architecture of selective sweeps. iHS scores near to certain very pronounced selective sweeps (e.g. LCT and SLC24A5) failed to compute due to inherent properties of the statistics, because either (i) the selected haplotype was near fixation or (ii) the EHH did not drop below the defined threshold in a given window. Examples for both positive (SLC45A2) and balancing (HLA region) selection are visualized in Figure 2. As expected, Tajima’s D scores around HLA (54) as well as the ABO locus (55) (data not shown) were pronouncedly elevated in all three analyzed populations, a pattern which is compatible with the action of balancing selection.

Figure 2.

Examples of genomic regions under selection in the 1000 genomes selection browser. Tracks of statistics from different populations are visualized in colour (CEU in green, CHB in red and YRI in blue). Additional examples are given at http://hsb.upf.edu (A) The p- and q-arms of chromosome 2 (−log10 of empirically ranked scores). Recurrent peaks at around 72.5 Mb (left green arrow) and 109.5 Mb (right green arrow) indicate the loci CYP26B1/EXOC6B and EDAR, respectively. (B) Signature of positive selection around SLC45A2, another established skin colour gene, in the CEU population (0.5-Mb window; −log10 of empirically ranked scores). (C) Widespread balancing selection in the HLA region indicated by strongly positive scores for Tajima’s D in all three analysed human populations (0.5-Mb window).

COMPARISON TO OTHER WEB RESOURCES

As for positive selection based on between-species comparisons, the Selectome database (http://bioinfo.unil.ch/selectome/; (56)) presents results based on the dN/dS method using a branch-site specific likelihood test. As for recent natural selection within modern humans, a number of web resources are available. For previous datasets, e.g. the HapMap 2 and HGDP projects, several positive selection statistics are available in form of the haplotter tool (http://haplotter.uchicago.edu/; (24)) and in form of the HGDP selection browser (http://hgdp.uchicago.edu/; (57)). For the 1000 genomes project data, the online tool ENGINES (http://spsmart.cesga.es; (58)) is useful for the analysis of allele frequencies and a recent study presented a method to calculate corrected summary statistics from low coverage sequencing data (40). dbPSHP (http://jjwanglab.org/dbpshp) offers a large number of statistical tests in a SNP-specific manner for HapMap 3 and 1000 genomes datasets. Complementary to these databases, our database gives a large number of region- and SNP-specific scores (depending on the test statistic) based on resequencing data (1000 genomes Phase 1), with a special focus on genome-wide significance (by the ranked scores) and the visualization of several statistics in parallel (Figure 2).

CONCLUSIONS

By applying a large number of summary statistics to data from the 1000 genomes project, we have built a timely and expandable resource for the population genomics research community. An associated user-friendly genome browser gives a visual impression of the genetic variation in a genomic region of interest and offers functionality for an array of down-stream analyses. While this resource will not replace a thorough, case by case analysis of selection, we expect that it will prove useful for the research community through the large number of test statistics and the fine-grained character of resequencing data.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

54 in total

Review 1. Molecular spandrels: tests of adaptation at the genetic level.

Authors: Rowan D H Barrett; Hopi E Hoekstra
Journal: Nat Rev Genet Date: 2011-10-18 Impact factor: 53.242

2. Association of trypanolytic ApoL1 variants with kidney disease in African Americans.

Authors: Giulio Genovese; David J Friedman; Michael D Ross; Laurence Lecordier; Pierrick Uzureau; Barry I Freedman; Donald W Bowden; Carl D Langefeld; Taras K Oleksyk; Andrea L Uscinski Knob; Andrea J Bernhardy; Pamela J Hicks; George W Nelson; Benoit Vanhollebeke; Cheryl A Winkler; Jeffrey B Kopp; Etienne Pays; Martin R Pollak
Journal: Science Date: 2010-07-15 Impact factor: 47.728

3. ENGINES: exploring single nucleotide variation in entire human genomes.

Authors: Jorge Amigo; Antonio Salas; Christopher Phillips
Journal: BMC Bioinformatics Date: 2011-04-19 Impact factor: 3.169

4. Classic selective sweeps were rare in recent human evolution.

Authors: Ryan D Hernandez; Joanna L Kelley; Eyal Elyashiv; S Cord Melton; Adam Auton; Gilean McVean; Guy Sella; Molly Przeworski
Journal: Science Date: 2011-02-18 Impact factor: 47.728

5. Evolutionary and functional evidence for positive selection at the human CD5 immune receptor gene.

Authors: Elena Carnero-Montoro; Lizette Bonet; Johannes Engelken; Torsten Bielig; Mario Martínez-Florensa; Francisco Lozano; Elena Bosch
Journal: Mol Biol Evol Date: 2011-10-13 Impact factor: 16.240

6. A map of human genome variation from population-scale sequencing.

Authors: Gonçalo R Abecasis; David Altshuler; Adam Auton; Lisa D Brooks; Richard M Durbin; Richard A Gibbs; Matt E Hurles; Gil A McVean
Journal: Nature Date: 2010-10-28 Impact factor: 49.962

7. Constructing genomic maps of positive selection in humans: where do we go from here?

Authors: Joshua M Akey
Journal: Genome Res Date: 2009-05 Impact factor: 9.043

8. Ten simple rules for getting help from online scientific communities.

Authors: Giovanni M Dall'Olio; Jacopo Marino; Michael Schubert; Kevin L Keys; Melanie I Stefan; Colin S Gillespie; Pierre Poulain; Khader Shameer; Robert Sugar; Brandon M Invergo; Lars J Jensen; Jaume Bertranpetit; Hafid Laayouni
Journal: PLoS Comput Biol Date: 2011-09-29 Impact factor: 4.475

9. The UCSC Genome Browser database: extensions and updates 2011.

Authors: Timothy R Dreszer; Donna Karolchik; Ann S Zweig; Angie S Hinrichs; Brian J Raney; Robert M Kuhn; Laurence R Meyer; Mathew Wong; Cricket A Sloan; Kate R Rosenbloom; Greg Roe; Brooke Rhead; Andy Pohl; Venkat S Malladi; Chin H Li; Katrina Learned; Vanessa Kirkup; Fan Hsu; Rachel A Harte; Luvina Guruvadoo; Mary Goldman; Belinda M Giardine; Pauline A Fujita; Mark Diekhans; Melissa S Cline; Hiram Clawson; Galt P Barber; David Haussler; W James Kent
Journal: Nucleic Acids Res Date: 2011-11-15 Impact factor: 16.971

10. Positive selection of a CD36 nonsense variant in sub-Saharan Africa, but no association with severe malaria phenotypes.

Authors: Andrew E Fry; Anita Ghansa; Kerrin S Small; Alejandro Palma; Sarah Auburn; Mahamadou Diakite; Angela Green; Susana Campino; Yik Y Teo; Taane G Clark; Anna E Jeffreys; Jonathan Wilson; Muminatou Jallow; Fatou Sisay-Joof; Margaret Pinder; Michael J Griffiths; Norbert Peshu; Thomas N Williams; Charles R Newton; Kevin Marsh; Malcolm E Molyneux; Terrie E Taylor; Kwadwo A Koram; Abraham R Oduro; William O Rogers; Kirk A Rockett; Pardis C Sabeti; Dominic P Kwiatkowski
Journal: Hum Mol Genet Date: 2009-04-29 Impact factor: 6.150

75 in total

1. Signature of positive selection of PTK6 gene in East Asian populations: a cross talk for Helicobacter pylori invasion and gastric cancer endemicity.

Authors: Pankaj Jha; Dongsheng Lu; Yuan Yuan; Shuhua Xu
Journal: Mol Genet Genomics Date: 2015-04-03 Impact factor: 3.291

2. Detection and Classification of Hard and Soft Sweeps from Unphased Genotypes by Multilocus Genotype Identity.

Authors: Alexandre M Harris; Nandita R Garud; Michael DeGiorgio
Journal: Genetics Date: 2018-10-12 Impact factor: 4.562

3. An Ancient Fecundability-Associated Polymorphism Switches a Repressor into an Enhancer of Endometrial TAP2 Expression.

Authors: Katelyn M Mika; Vincent J Lynch
Journal: Am J Hum Genet Date: 2016-10-13 Impact factor: 11.025

4. Identifying and Classifying Shared Selective Sweeps from Multilocus Data.

Authors: Alexandre M Harris; Michael DeGiorgio
Journal: Genetics Date: 2020-03-09 Impact factor: 4.562

5. Genome-Wide Interactions with Dairy Intake for Body Mass Index in Adults of European Descent.

Authors: Caren E Smith; Jack L Follis; Hassan S Dashti; Toshiko Tanaka; Mariaelisa Graff; Amanda M Fretts; Tuomas O Kilpeläinen; Mary K Wojczynski; Kris Richardson; Mike A Nalls; Christina-Alexandra Schulz; Yongmei Liu; Alexis C Frazier-Wood; Esther van Eekelen; Carol Wang; Paul S de Vries; Vera Mikkilä; Rebecca Rohde; Bruce M Psaty; Torben Hansen; Mary F Feitosa; Chao-Qiang Lai; Denise K Houston; Luigi Ferruci; Ulrika Ericson; Zhe Wang; Renée de Mutsert; Wendy H Oddy; Ester A L de Jonge; Ilkka Seppälä; Anne E Justice; Rozenn N Lemaitre; Thorkild I A Sørensen; Michael A Province; Laurence D Parnell; Melissa E Garcia; Stefania Bandinelli; Marju Orho-Melander; Stephen S Rich; Frits R Rosendaal; Craig E Pennell; Jessica C Kiefte-de Jong; Mika Kähönen; Kristin L Young; Oluf Pedersen; Stella Aslibekyan; Jerome I Rotter; Dennis O Mook-Kanamori; M Carola Zillikens; Olli T Raitakari; Kari E North; Kim Overvad; Donna K Arnett; Albert Hofman; Terho Lehtimäki; Anne Tjønneland; André G Uitterlinden; Fernando Rivadeneira; Oscar H Franco; J Bruce German; David S Siscovick; L Adrienne Cupples; José M Ordovás
Journal: Mol Nutr Food Res Date: 2017-12-11 Impact factor: 5.914

6. Regional selection of the brain size regulating gene CASC5 provides new insight into human brain evolution.

Authors: Lei Shi; Enzhi Hu; Zhenbo Wang; Jiewei Liu; Jin Li; Ming Li; Hua Chen; Chunshui Yu; Tianzi Jiang; Bing Su
Journal: Hum Genet Date: 2016-11-22 Impact factor: 4.132

7. The association between the FABP2 Ala54Thr variant and the risk of type 2 diabetes mellitus: a meta-analysis based on 11 case-control studies.

Authors: Peng Liu; Dan Yu; Xiaoping Jin; Cai Li; Feng Zhu; Zhou Zheng; Chenlin Lv; Xinwei He
Journal: Int J Clin Exp Med Date: 2015-04-15