Literature DB >> 26248562

hapbin: An Efficient Program for Performing Haplotype-Based Scans for Positive Selection in Large Genomic Datasets.

Colin A Maclean¹, Neil P Chue Hong¹, James G D Prendergast².

Abstract

Understanding how the genome is shaped by selective processes forms an integral part of modern biology. However, as genomic datasets continue to grow larger it is becoming increasingly difficult to apply traditional statistics for detecting signatures of selection to these cohorts. There is therefore a pressing need for the development of the next generation of computational and analytical tools for detecting signatures of selection in large genomic datasets. Here, we present hapbin, an efficient multithreaded implementation of extended haplotype homzygosity-based statistics for detecting selection, which is up to 3,400 times faster than the current fastest implementations of these algorithms.

Entities: CellLine Species

Keywords: EHH; XP-EHH; iHS; selection; software

Mesh：

Year: 2015 PMID： 26248562 PMCID： PMC4651233 DOI： 10.1093/molbev/msv172

Source DB: PubMed Journal: Mol Biol Evol ISSN： 0737-4038 Impact factor: 16.240

As a selected allele is swept through a population the haplotype on which it resides will increase in frequency faster than recombination can break it down. As a result alleles under positive selection will be expected to reside upon unusually long haplotypes given their frequency, and such extended haplotype homozygosity (EHH) (Sabeti et al. 2002) forms the basis of a number of the most popular tests of selection including the integrated haplotype score (iHS) (Voight et al. 2006) and the cross-population EHH (XP-EHH) statistic (Sabeti et al. 2007). These haplotype-based methods of detecting selection have a number of advantages over other tests; for example their ability to detect partial or incomplete sweeps (Vitti et al. 2013), short-term balancing selection (Vitti et al. 2013) and their comparative insensitivity to background selection (reduced neutral variation as a result of purifying selection at linked deleterious sites) that can otherwise confound studies of adaptive evolution (Enard et al. 2014). However with sequencing costs falling faster than computational speeds are increasing (Check Hayden 2014), as genomic datasets grow larger it is becoming increasingly difficult to apply these statistics to contemporary cohorts. Recently an improved implementation of these statistics was made available within the selscan program (Szpiech and Hernandez 2014), demonstrated to be two times faster at calculating iHS than the next fastest algorithm, rehh (Gautier and Vitalis 2012). However even with this improved implementation of these statistics the calculation of iHS across 100 whole human genomes, the approximate average size of a 1000 genomes (1000 Genomes Project Consortium et al. 2010) population cohort, is still expected to take over 2 months when run on a single core on a standard desktop machine. For these algorithms to be widely used, there is a requirement for the development of new, faster, and more efficient, computational approaches to improve the speed at which EHH-based selection scans can be carried out. As a result, allowing for the analysis of whole-genome sequencing datasets of ever increasing size to be processed in reasonable timeframes and on non-specialist hardware. Here, we introduce hapbin that utilizes a new computational approach (see Supplementary methods, Supplementary Material online) to calculate the EHH, iHS, and XP-EHH statistics. We show that this implementation is up to 3,400 times faster than even selscan, allowing iHS to be calculated across 100 fully sequenced human genomes in ∼3 h, as opposed to over 2 months, when run on a single core on a standard desktop machine. To assess the performance of hapbin, it was first benchmarked alongside selscan on two different hardware architectures. An Amazon c3.8xlarge EC2 Ubuntu instance (32 CPUs) as well as on ARCHER; the UK National Supercomputer. Importantly hapbin will equally run on a standard desktop machine but the use of these resources allowed us to assess its scalability while also enabling other users to repeat these analyses. Performance was characterized using various subsets of data from chromosome 22 of the 1000 genomes project (1000 Genomes Project Consortium et al. 2010) cohort (phase 1 version 3) and both programs were run with default parameters (an EHH decay cutoff of 0.05 and minimum minor allele frequency of 5%). As shown in figures 1A–C, Supplementary figure S1 and table S1, Supplementary Material online, hapbin is from 88 to 3,400 times quicker than selscan at calculating the iHS, depending on the hardware used and the number of variants and individuals in the input dataset. With an input cohort of 50 individuals hapbin processed all 489,301 genetic variants on chromosome 22 in 37 s when run across one core on ARCHER. In comparison, selscan took 35 h. As shown in figure 1D, this speedup comes with no loss of accuracy.

Hapbin versus selscan comparisons. (A) Time taken by hapbin and selscan to calculate iHS across chromosome 22 across 48 cores (1 node) onz ARCHER and on an Amazon c3.8xlarge instance. Subsets of individuals being randomly sampled from the 1000 genomes dataset. (B) Time taken by hapbin and selscan to calculate iHS in the 1000 genomes GBR (Great Britain) population of 89 individuals on the Amazon c3.8xlarge instance. Runs of contiguous SNPs by location were subsampled from all of those on chromosome 22. (C) Hapbin’s relative speedup versus selscan when run across chromosome 22 with varying numbers of cores and individuals on ARCHER. (D) Comparison of unstandardized iHS values output by both selscan and hapbin when run across 500 randomly selected individuals and all SNPs on chromosome 22. A further advantage of hapbin is that while selscan requires a further program to produce standardized iHS from the unstandardized values it outputs, hapbin produces both by default. As a result all selscan timings presented here are the times taken to calculate unstandardized iHS only, while those for hapbin are for the calculation of both standardized and unstandardized values. Hapbin’s relative speedup at calculating XP-EHH with respect to selscan are more modest than those observed when calculating iHS but order of magnitude speedups are still observed (Supplementary fig. S1, Supplementary Material online). The hapbin program can be downloaded from https://github.com/evotools/hapbin (last accessed August 10, 2015). Hapbin can be applied to datasets from any meiotically recombinant species and takes phased genotypes in IMPUTE format (Howie et al. 2009), as produced by the popular phasing algorithm SHAPEIT2 (O’Connell et al. 2014). To accompany the program, we have also exploited the speed of hapbin to calculate iHS genome-wide for all 26 populations in the recently released, phased, 1000 genomes phase 3 cohort (Delaneau et al. 2014). These can be downloaded from http://dx.doi.org/10.7488/ds/214 (last accessed August 10, 2015) or viewed at the UCSC genome browser (Supplementary figs. S2 and S3, Supplementary Material online).

Supplementary Material

Supplementary methods, figures S1–S3, and table S1 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

12 in total

1. Detecting recent positive selection in the human genome from haplotype structure.

Authors: Pardis C Sabeti; David E Reich; John M Higgins; Haninah Z P Levine; Daniel J Richter; Stephen F Schaffner; Stacey B Gabriel; Jill V Platko; Nick J Patterson; Gavin J McDonald; Hans C Ackerman; Sarah J Campbell; David Altshuler; Richard Cooper; Dominic Kwiatkowski; Ryk Ward; Eric S Lander
Journal: Nature Date: 2002-10-09 Impact factor: 49.962

2. rehh: an R package to detect footprints of selection in genome-wide SNP data from haplotype structure.

Authors: Mathieu Gautier; Renaud Vitalis
Journal: Bioinformatics Date: 2012-03-07 Impact factor: 6.937

3. Technology: The $1,000 genome.

Authors: Erika Check Hayden
Journal: Nature Date: 2014-03-20 Impact factor: 49.962

4. A map of human genome variation from population-scale sequencing.

Authors: Gonçalo R Abecasis; David Altshuler; Adam Auton; Lisa D Brooks; Richard M Durbin; Richard A Gibbs; Matt E Hurles; Gil A McVean
Journal: Nature Date: 2010-10-28 Impact factor: 49.962

5. Genome-wide detection and characterization of positive selection in human populations.

Authors: Pardis C Sabeti; Patrick Varilly; Ben Fry; Jason Lohmueller; Elizabeth Hostetter; Chris Cotsapas; Xiaohui Xie; Elizabeth H Byrne; Steven A McCarroll; Rachelle Gaudet; Stephen F Schaffner; Eric S Lander; Kelly A Frazer; Dennis G Ballinger; David R Cox; David A Hinds; Laura L Stuve; Richard A Gibbs; John W Belmont; Andrew Boudreau; Paul Hardenbol; Suzanne M Leal; Shiran Pasternak; David A Wheeler; Thomas D Willis; Fuli Yu; Huanming Yang; Changqing Zeng; Yang Gao; Haoran Hu; Weitao Hu; Chaohua Li; Wei Lin; Siqi Liu; Hao Pan; Xiaoli Tang; Jian Wang; Wei Wang; Jun Yu; Bo Zhang; Qingrun Zhang; Hongbin Zhao; Hui Zhao; Jun Zhou; Stacey B Gabriel; Rachel Barry; Brendan Blumenstiel; Amy Camargo; Matthew Defelice; Maura Faggart; Mary Goyette; Supriya Gupta; Jamie Moore; Huy Nguyen; Robert C Onofrio; Melissa Parkin; Jessica Roy; Erich Stahl; Ellen Winchester; Liuda Ziaugra; David Altshuler; Yan Shen; Zhijian Yao; Wei Huang; Xun Chu; Yungang He; Li Jin; Yangfan Liu; Yayun Shen; Weiwei Sun; Haifeng Wang; Yi Wang; Ying Wang; Xiaoyan Xiong; Liang Xu; Mary M Y Waye; Stephen K W Tsui; Hong Xue; J Tze-Fei Wong; Luana M Galver; Jian-Bing Fan; Kevin Gunderson; Sarah S Murray; Arnold R Oliphant; Mark S Chee; Alexandre Montpetit; Fanny Chagnon; Vincent Ferretti; Martin Leboeuf; Jean-François Olivier; Michael S Phillips; Stéphanie Roumy; Clémentine Sallée; Andrei Verner; Thomas J Hudson; Pui-Yan Kwok; Dongmei Cai; Daniel C Koboldt; Raymond D Miller; Ludmila Pawlikowska; Patricia Taillon-Miller; Ming Xiao; Lap-Chee Tsui; William Mak; You Qiang Song; Paul K H Tam; Yusuke Nakamura; Takahisa Kawaguchi; Takuya Kitamoto; Takashi Morizono; Atsushi Nagashima; Yozo Ohnishi; Akihiro Sekine; Toshihiro Tanaka; Tatsuhiko Tsunoda; Panos Deloukas; Christine P Bird; Marcos Delgado; Emmanouil T Dermitzakis; Rhian Gwilliam; Sarah Hunt; Jonathan Morrison; Don Powell; Barbara E Stranger; Pamela Whittaker; David R Bentley; Mark J Daly; Paul I W de Bakker; Jeff Barrett; Yves R Chretien; Julian Maller; Steve McCarroll; Nick Patterson; Itsik Pe'er; Alkes Price; Shaun Purcell; Daniel J Richter; Pardis Sabeti; Richa Saxena; Stephen F Schaffner; Pak C Sham; Patrick Varilly; David Altshuler; Lincoln D Stein; Lalitha Krishnan; Albert Vernon Smith; Marcela K Tello-Ruiz; Gudmundur A Thorisson; Aravinda Chakravarti; Peter E Chen; David J Cutler; Carl S Kashuk; Shin Lin; Gonçalo R Abecasis; Weihua Guan; Yun Li; Heather M Munro; Zhaohui Steve Qin; Daryl J Thomas; Gilean McVean; Adam Auton; Leonardo Bottolo; Niall Cardin; Susana Eyheramendy; Colin Freeman; Jonathan Marchini; Simon Myers; Chris Spencer; Matthew Stephens; Peter Donnelly; Lon R Cardon; Geraldine Clarke; David M Evans; Andrew P Morris; Bruce S Weir; Tatsuhiko Tsunoda; Todd A Johnson; James C Mullikin; Stephen T Sherry; Michael Feolo; Andrew Skol; Houcan Zhang; Changqing Zeng; Hui Zhao; Ichiro Matsuda; Yoshimitsu Fukushima; Darryl R Macer; Eiko Suda; Charles N Rotimi; Clement A Adebamowo; Ike Ajayi; Toyin Aniagwu; Patricia A Marshall; Chibuzor Nkwodimmah; Charmaine D M Royal; Mark F Leppert; Missy Dixon; Andy Peiffer; Renzong Qiu; Alastair Kent; Kazuto Kato; Norio Niikawa; Isaac F Adewole; Bartha M Knoppers; Morris W Foster; Ellen Wright Clayton; Jessica Watkin; Richard A Gibbs; John W Belmont; Donna Muzny; Lynne Nazareth; Erica Sodergren; George M Weinstock; David A Wheeler; Imtaz Yakub; Stacey B Gabriel; Robert C Onofrio; Daniel J Richter; Liuda Ziaugra; Bruce W Birren; Mark J Daly; David Altshuler; Richard K Wilson; Lucinda L Fulton; Jane Rogers; John Burton; Nigel P Carter; Christopher M Clee; Mark Griffiths; Matthew C Jones; Kirsten McLay; Robert W Plumb; Mark T Ross; Sarah K Sims; David L Willey; Zhu Chen; Hua Han; Le Kang; Martin Godbout; John C Wallenburg; Paul L'Archevêque; Guy Bellemare; Koji Saeki; Hongguang Wang; Daochang An; Hongbo Fu; Qing Li; Zhen Wang; Renwu Wang; Arthur L Holden; Lisa D Brooks; Jean E McEwen; Mark S Guyer; Vivian Ota Wang; Jane L Peterson; Michael Shi; Jack Spiegel; Lawrence M Sung; Lynn F Zacharia; Francis S Collins; Karen Kennedy; Ruth Jamieson; John Stewart
Journal: Nature Date: 2007-10-18 Impact factor: 49.962

6. Genome-wide signals of positive selection in human evolution.

Authors: David Enard; Philipp W Messer; Dmitri A Petrov
Journal: Genome Res Date: 2014-03-11 Impact factor: 9.043

7. Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel.

Authors: Olivier Delaneau; Jonathan Marchini
Journal: Nat Commun Date: 2014-06-13 Impact factor: 14.919

8. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies.

Authors: Bryan N Howie; Peter Donnelly; Jonathan Marchini
Journal: PLoS Genet Date: 2009-06-19 Impact factor: 5.917

9. A general approach for haplotype phasing across the full spectrum of relatedness.

Authors: Jared O'Connell; Deepti Gurdasani; Olivier Delaneau; Nicola Pirastu; Sheila Ulivi; Massimiliano Cocca; Michela Traglia; Jie Huang; Jennifer E Huffman; Igor Rudan; Ruth McQuillan; Ross M Fraser; Harry Campbell; Ozren Polasek; Gershim Asiki; Kenneth Ekoru; Caroline Hayward; Alan F Wright; Veronique Vitart; Pau Navarro; Jean-Francois Zagury; James F Wilson; Daniela Toniolo; Paolo Gasparini; Nicole Soranzo; Manjinder S Sandhu; Jonathan Marchini
Journal: PLoS Genet Date: 2014-04-17 Impact factor: 5.917

10. selscan: an efficient multithreaded program to perform EHH-based scans for positive selection.

Authors: Zachary A Szpiech; Ryan D Hernandez
Journal: Mol Biol Evol Date: 2014-07-10 Impact factor: 16.240

20 in total

1. Comparing signals of natural selection between three Indigenous North American populations.

Authors: Austin W Reynolds; Jaime Mata-Míguez; Aida Miró-Herrans; Marcus Briggs-Cloud; Ana Sylestine; Francisco Barajas-Olmos; Humberto Garcia-Ortiz; Margarita Rzhetskaya; Lorena Orozco; Jennifer A Raff; M Geoffrey Hayes; Deborah A Bolnick
Journal: Proc Natl Acad Sci U S A Date: 2019-04-15 Impact factor: 11.205

2. A sex-specific evolutionary interaction between ADCY9 and CETP.

Authors: Isabel Gamache; Marc-André Legault; Jean-Christophe Grenier; Rocio Sanchez; Eric Rhéaume; Samira Asgari; Amina Barhdadi; Yassamin Feroz Zada; Holly Trochet; Yang Luo; Leonid Lecca; Megan Murray; Soumya Raychaudhuri; Jean-Claude Tardif; Marie-Pierre Dubé; Julie Hussin
Journal: Elife Date: 2021-10-05 Impact factor: 8.140

3. Adaptively introgressed Neandertal haplotype at the OAS locus functionally impacts innate immune responses in humans.

Authors: Aaron J Sams; Anne Dumaine; Yohann Nédélec; Vania Yotova; Carolina Alfieri; Jerome E Tanner; Philipp W Messer; Luis B Barreiro
Journal: Genome Biol Date: 2016-11-29 Impact factor: 13.583

4. Cohort-wide deep whole genome sequencing and the allelic architecture of complex traits.

Authors: Arthur Gilly; Daniel Suveges; Karoline Kuchenbaecker; Martin Pollard; Lorraine Southam; Konstantinos Hatzikotoulas; Aliki-Eleni Farmaki; Thea Bjornland; Ryan Waples; Emil V R Appel; Elisabetta Casalone; Giorgio Melloni; Britt Kilian; Nigel W Rayner; Ioanna Ntalla; Kousik Kundu; Klaudia Walter; John Danesh; Adam Butterworth; Inês Barroso; Emmanouil Tsafantakis; George Dedoussis; Ida Moltke; Eleftheria Zeggini
Journal: Nat Commun Date: 2018-11-07 Impact factor: 14.919

5. Positive selection in Europeans and East-Asians at the ABCA12 gene.

Authors: Roberto Sirica; Marianna Buonaiuto; Valeria Petrella; Lucia Sticco; Donatella Tramontano; Dario Antonini; Caterina Missero; Ombretta Guardiola; Gennaro Andolfi; Heerman Kumar; Qasim Ayub; Yali Xue; Chris Tyler-Smith; Marco Salvemini; Giovanni D'Angelo; Vincenza Colonna
Journal: Sci Rep Date: 2019-03-19 Impact factor: 4.379

6. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection.

Authors: Antonio F Pardiñas; Peter Holmans; Andrew J Pocklington; Valentina Escott-Price; Stephan Ripke; Noa Carrera; Sophie E Legge; Sophie Bishop; Darren Cameron; Marian L Hamshere; Jun Han; Leon Hubbard; Amy Lynham; Kiran Mantripragada; Elliott Rees; James H MacCabe; Steven A McCarroll; Bernhard T Baune; Gerome Breen; Enda M Byrne; Udo Dannlowski; Thalia C Eley; Caroline Hayward; Nicholas G Martin; Andrew M McIntosh; Robert Plomin; David J Porteous; Naomi R Wray; Armando Caballero; Daniel H Geschwind; Laura M Huckins; Douglas M Ruderfer; Enrique Santiago; Pamela Sklar; Eli A Stahl; Hyejung Won; Esben Agerbo; Thomas D Als; Ole A Andreassen; Marie Bækvad-Hansen; Preben Bo Mortensen; Carsten Bøcker Pedersen; Anders D Børglum; Jonas Bybjerg-Grauholm; Srdjan Djurovic; Naser Durmishi; Marianne Giørtz Pedersen; Vera Golimbet; Jakob Grove; David M Hougaard; Manuel Mattheisen; Espen Molden; Ole Mors; Merete Nordentoft; Milica Pejovic-Milovancevic; Engilbert Sigurdsson; Teimuraz Silagadze; Christine Søholm Hansen; Kari Stefansson; Hreinn Stefansson; Stacy Steinberg; Sarah Tosato; Thomas Werge; David A Collier; Dan Rujescu; George Kirov; Michael J Owen; Michael C O'Donovan; James T R Walters
Journal: Nat Genet Date: 2018-02-26 Impact factor: 38.330

7. An ancient viral epidemic involving host coronavirus interacting genes more than 20,000 years ago in East Asia.

Authors: Yassine Souilmi; M Elise Lauterbur; Ray Tobler; Christian D Huber; Angad S Johar; Shayli Varasteh Moradi; Wayne A Johnston; Nevan J Krogan; Kirill Alexandrov; David Enard
Journal: Curr Biol Date: 2021-06-17 Impact factor: 10.834

8. The mosaic genome of indigenous African cattle as a unique genetic resource for African pastoralism.

Authors: Kwondo Kim; Taehyung Kwon; Tadelle Dessie; DongAhn Yoo; Okeyo Ally Mwai; Jisung Jang; Samsun Sung; SaetByeol Lee; Bashir Salim; Jaehoon Jung; Heesu Jeong; Getinet Mekuriaw Tarekegn; Abdulfatai Tijjani; Dajeong Lim; Seoae Cho; Sung Jong Oh; Hak-Kyo Lee; Jaemin Kim; Choongwon Jeong; Stephen Kemp; Olivier Hanotte; Heebal Kim
Journal: Nat Genet Date: 2020-09-28 Impact factor: 41.307

9. Antimicrobial Functions of Lactoferrin Promote Genetic Conflicts in Ancient Primates and Modern Humans.

Authors: Matthew F Barber; Zev Kronenberg; Mark Yandell; Nels C Elde
Journal: PLoS Genet Date: 2016-05-20 Impact factor: 5.917

10. Selection Signatures Underlying Dramatic Male Inflorescence Transformation During Modern Hybrid Maize Breeding.

Authors: Joseph L Gage; Michael R White; Jode W Edwards; Shawn Kaeppler; Natalia de Leon
Journal: Genetics Date: 2018-09-26 Impact factor: 4.562