SUMMARY: Sequencing pooled DNA samples (Pool-Seq) is the most cost-effective approach for the genome-wide comparison of population samples. Here, we introduce PoPoolation2, the first software tool specifically designed for the comparison of populations with Pool-Seq data. PoPoolation2 implements a range of commonly used measures of differentiation (F(ST), Fisher's exact test and Cochran-Mantel-Haenszel test) that can be applied on different scales (windows, genes, exons, SNPs). The result may be visualized with the widely used Integrated Genomics Viewer. AVAILABILITY AND IMPLEMENTATION: PoPoolation2 is implemented in Perl and R. It is freely available on http://code.google.com/p/popoolation2/ CONTACT: christian.schloetterer@vetmeduni.ac.at SUPPLEMENTARY INFORMATION: Manual: http://code.google.com/p/popoolation2/wiki/Manual Test data and tutorial: http://code.google.com/p/popoolation2/wiki/Tutorial Validation: http://code.google.com/p/popoolation2/wiki/Validation.
SUMMARY: Sequencing pooled DNA samples (Pool-Seq) is the most cost-effective approach for the genome-wide comparison of population samples. Here, we introduce PoPoolation2, the first software tool specifically designed for the comparison of populations with Pool-Seq data. PoPoolation2 implements a range of commonly used measures of differentiation (F(ST), Fisher's exact test and Cochran-Mantel-Haenszel test) that can be applied on different scales (windows, genes, exons, SNPs). The result may be visualized with the widely used Integrated Genomics Viewer. AVAILABILITY AND IMPLEMENTATION: PoPoolation2 is implemented in Perl and R. It is freely available on http://code.google.com/p/popoolation2/ CONTACT: christian.schloetterer@vetmeduni.ac.at SUPPLEMENTARY INFORMATION: Manual: http://code.google.com/p/popoolation2/wiki/Manual Test data and tutorial: http://code.google.com/p/popoolation2/wiki/Tutorial Validation: http://code.google.com/p/popoolation2/wiki/Validation.
Next-generation sequencing of pooled DNA samples (Pool-Seq) allows the comparison of population samples on a genomic scale, thus facilitating the transition from single marker studies to population genomics. Due to its cost-effectiveness (Futschik and Schlötterer, 2010), Pool-Seq can be used for a range of applications. The most intuitive application is the comparison of natural populations to perform standard population genetic analyses on a genomic scale (e.g. Begun ). The comparison of natural Arabidopsis lyrata populations from different habitats allowed the characterization of genes involved in heavy metal tolerance (Turner ). Also in experimental evolution studies, Pool-Seq has been used to identify genomic regions that show high differentiation between different selective treatments (Burke ; Parts ; Turner ). Finally, Pool-Seq offers an enormous potential for selective genotyping (Darvasi and Soller, 1994; Hillel ; Lander and Botstein, 1989).While several tools for analyzing Pool-Seq data of single populations are already available (Bansal, 2010; Kofler ; Pandey ), to our knowledge no standalone software tool is available for the comparison of Pool-Seq data for multiple populations. PoPoolation2 is a software tool dedicated to the comparison of allele frequencies between populations.
2 IMPLEMENTATION
As input PoPoolation2 requires a ‘pileup’ file for every population (sample) of interest or alternatively a single multi ‘pileup’ file (mpileup) may be used. These files can be obtained by mapping the reads of a Pool-Seq experiment to a reference genome and subsequently converting the mapping results into the ‘pileup/mpileup’ format with samtools (Li ) (For Manual see http://code.google.com/p/popoolation2/wiki/Manual; Test data and tutorial http://code.google.com/p/popoolation2/wiki/Tutorial). PoPoolation2 requires Pool-Seq data from at least two populations, but may be used with an unlimited number of populations.To assess allele frequency differences between population samples PoPoolation2 implements a wide variety of statistics.
When data from more than two populations are available, PoPoolation2 automatically computes all pairwise comparisons for these tests (except for the CMH test).As the most intuitive measure of population differentiation, the allele frequency differences are reported.The fixation index (FST) can be calculated to measure differentiation between populations. FST values may either be calculated with the classical approach (Hartl and Clark, 2007) or with an approach adapted to digital data (Karlsson )The statistical significance of allele frequency differences is determined with Fisher's exact test (Fisher, 1922).Since in experimental evolution experiments and selective genotyping studies often biological replicates are available, we implemented the Cochran–Mantel–Haenszel (CMH) test (Landis ) to test for the statistical significance between groups.All these analyses can be performed on different levels. We have implemented a sliding window analysis, which permits a genome-wide scan for differentiation using a specified window size. For the analysis of single SNPs, a window size of 1 may be used. Finally, with a user-provided GTF file the analysis of genes, coding sequence, introns, etc. is possible. To visualize the population differentiation across the genome, PoPoolation2 converts the results into file formats that are compatible with the Integrative Genomics Viewer (Robinson ).Finally, PoPoolation2 also implements the functionality to randomly subsample the data to achieve a uniform coverage. The subsampling is based on a user-defined quality threshold. For analyzing the data with standard software, such as Mega5 (Tamura ) and Arlequin (Excoffier and Lischer, 2010), PoPoolation2 allows exporting the data as artificial chromosomes as ‘multi-fasta’ files and as ‘GenePop’ files (Raymond and Rousset, 1995).
3 VALIDATION
To test PoPoolation2, we placed 10 000 SNPs for two populations on chromosome 2R of Drosophila melanogaster (v5.38). For these SNPs, we simulated 75 bp reads such that the coverage was 100× and the allele frequency differences between the two populations ranged from 0.1 to 0.9. Subsequently, the simulated reads were mapped to the reference genome (D.melanogaster, chromosome 2R, v5.38) with BWA (0.5.8) (Li and Durbin, 2009) and a ‘mpileup’ file was created using samtools (0.1.13) (Li ). Finally, we compared the expected values with the observed ones and found an almost perfect correlation between the simulated data and the estimates based on PoPoolation2 for all implemented tests (allele frequency differences: R2=0.9979, P<2.2e-16; FST: R2=0.9967, P<2.2e-16; Fisher's exact test: R2=0.9974, P<2.2e-16; CMH test: R2=0.9978, P<2.2e-16; Fig. 1). These high correlations confirm that PoPoolation2 yields highly reliable results (for details, see http://code.google.com/p/popoolation2/wiki/Validation).
Fig. 1.
Expected versus observed values for the tests implemented in PoPoolation2 using 10 000 simulated SNPs. (A) allele frequency difference; (B) FST; (C) Fisher's exact test [−log 10(P-value)]; (D) CMH test [−log 10(P-value)].
Expected versus observed values for the tests implemented in PoPoolation2 using 10 000 simulated SNPs. (A) allele frequency difference; (B) FST; (C) Fisher's exact test [−log 10(P-value)]; (D) CMH test [−log 10(P-value)].To ensure that all scripts continue to work properly, we implemented Unit-tests for the main scripts (which may be run by providing the parameter ‘–test’).
Authors: Leopold Parts; Francisco A Cubillos; Jonas Warringer; Kanika Jain; Francisco Salinas; Suzannah J Bumpstead; Mikael Molin; Amin Zia; Jared T Simpson; Michael A Quail; Alan Moses; Edward J Louis; Richard Durbin; Gianni Liti Journal: Genome Res Date: 2011-03-21 Impact factor: 9.043
Authors: David J Begun; Alisha K Holloway; Kristian Stevens; Ladeana W Hillier; Yu-Ping Poh; Matthew W Hahn; Phillip M Nista; Corbin D Jones; Andrew D Kern; Colin N Dewey; Lior Pachter; Eugene Myers; Charles H Langley Journal: PLoS Biol Date: 2007-11-06 Impact factor: 8.029
Authors: Thomas L Turner; Elizabeth C Bourne; Eric J Von Wettberg; Tina T Hu; Sergey V Nuzhdin Journal: Nat Genet Date: 2010-01-24 Impact factor: 38.330
Authors: Robert Kofler; Pablo Orozco-terWengel; Nicola De Maio; Ram Vinay Pandey; Viola Nolte; Andreas Futschik; Carolin Kosiol; Christian Schlötterer Journal: PLoS One Date: 2011-01-06 Impact factor: 3.752
Authors: Brock A Harpur; Samir M Kadri; Ricardo O Orsi; Charles W Whitfield; Amro Zayed Journal: Genome Biol Evol Date: 2020-08-01 Impact factor: 3.416
Authors: Xiao Hui Gu; Dan Li Jiang; Yan Huang; Bi Jun Li; Chao Hao Chen; Hao Ran Lin; Jun Hong Xia Journal: Mar Biotechnol (NY) Date: 2018-01-09 Impact factor: 3.619
Authors: Young Bun Kim; Jung Hun Oh; Lauren J McIver; Eugenia Rashkovetsky; Katarzyna Michalak; Harold R Garner; Lin Kang; Eviatar Nevo; Abraham B Korol; Pawel Michalak Journal: Proc Natl Acad Sci U S A Date: 2014-07-08 Impact factor: 11.205
Authors: R Rebecca Love; Aaron M Steele; Mamadou B Coulibaly; Sékou F Traore; Scott J Emrich; Michael C Fontaine; Nora J Besansky Journal: Mol Ecol Date: 2016-11-09 Impact factor: 6.185