| Literature DB >> 24739305 |
Bastian Pfeifer1, Ulrich Wittelsbürger1, Sebastian E Ramos-Onsins2, Martin J Lercher3.
Abstract
Although many computer programs can perform population genetics calculations, they are typically limited in the analyses and data input formats they offer; few applications can process the large data sets produced by whole-genome resequencing projects. Furthermore, there is no coherent framework for the easy integration of new statistics into existing pipelines, hindering the development and application of new population genetics and genomics approaches. Here, we present PopGenome, a population genomics package for the R software environment (a de facto standard for statistical analyses). PopGenome can efficiently process genome-scale data as well as large sets of individual loci. It reads DNA alignments and single-nucleotide polymorphism (SNP) data sets in most common formats, including those used by the HapMap, 1000 human genomes, and 1001 Arabidopsis genomes projects. PopGenome also reads associated annotation files in GFF format, enabling users to easily define regions or classify SNPs based on their annotation; all analyses can also be applied to sliding windows. PopGenome offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination. PopGenome is linked to Hudson's MS and Ewing's MSMS programs to assess statistical significance based on coalescent simulations. PopGenome's integration in R facilitates effortless and reproducible downstream analyses as well as the production of publication-quality graphics. Developers can easily incorporate new analyses methods into the PopGenome framework. PopGenome and R are freely available from CRAN (http://cran.r-project.org/) for all major operating systems under the GNU General Public License.Entities:
Keywords: population genomics; single-nucleotide polymorphisms; software
Mesh:
Year: 2014 PMID: 24739305 PMCID: PMC4069620 DOI: 10.1093/molbev/msu136
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
Population Genetics Statistics Implemented in PopGenome’s Modules.
| Module | Statistics |
|---|---|
| Neutrality statistics | Tajima’s |
| Linkage disequilibrium | ZnS ( |
| Recombination statistics | Four-gamete test ( |
| Diversities | Nucleotide and haplotype diversity ( |
| Selective sweeps | CL, CLR ( |
| FST estimates | |
| MKT | McDonald–Kreitman test ( |
| Mixed statistics | Site frequency spectrum; fixed and shared polymorphisms; biallelic structure |
| BayeScanR | Bayesian estimation of |
Times Required to Read Large Data Sets.
| Data Set | Individuals | SNPs | Format | Time for Reading |
|---|---|---|---|---|
| Arabidopsis (Chr 1) | 80 | 1,200,000 | SNP (1001 Genomes) | <1 min |
| ∼3 min | ||||
| Human (Chr2: 100–150 Mb) | 1,094 | 660,000 | VCF (1000 Genomes) | ∼5 min |
| 3450 individual alignments | 25 | 200,000 | FASTA | ∼15 s |
aIntel® Core™ i3-2130 CPU @ 3.40 GHz × 4, 8 GB RAM, with data stored in temporary files.
bWithout temporary files, if sufficient RAM is available.
Calculation Speed for Haplotype and Nucleotide Diversity in Sliding Windows.
| Data | Sliding Window (nucleotides) | Running Time |
|---|---|---|
| Arabidopsis (Chr 1) | Window size = 10,000 | ∼30 s |
| 80 individuals | Jump size = 10,000 | |
| 1,200,000 SNPs | Number of windows = 3,042 | |
| Human (Chr 2: 100–150 Mb) | Window size = 1,000 | ∼5 min |
| 1,094 individuals | Jump size = 1,000 | |
| 660,000 SNPs | Number of windows = 50,000 | |
| 3,450 alignments | 3,450 windows (alignments) | ∼7 s |
| 25 individuals | ||
| 5,086,953 sites | ||
| 200,097 SNPs |
aIntel® Core™ i3-2130 CPU @ 3.40 GHz × 4, 8 GB RAM.
FDiversity statistics for Arabidopsis thaliana chromosome 1. Data from the 1001 genomes project website (1001genomes.org) was analyzed in consecutive 10-kb windows. (A) Nucleotide diversity, (B) haplotype diversity, (C) fixation index (Hudson’s FST), contrasting one population against all other individuals. Each line corresponds to one population (see legend in panel [A]). Lines were smoothed using spline interpolation. The black bars around 15-Mb mask the centromere.
FTajima’s D calculated across nonsynonymous coding sites of exons in the human MHC region on chromosome 6. Each data point in (A) and (B) represents one exon; HLA type I and type II exons are shown in red. (A) Tajima’s D of a Tuscan population (117 individuals), plotted along chr. 6. (B) Comparison of Tajima’s D between a Tuscan (117 individuals) and a Yoruba (229 individuals) population. (C) Distribution (density curves) of the Tajima’s D values in (A) for MHC (red) and non-MHC exons (black). The blue curve displays the distribution of neutral values from coalescent simulations with Hudson’s MS based on all SNPs in the MHC region. Data from 1000genomes.org.
FComparison of PopGenome with existing software for population genetics and population genomics analyses. Symbols reflect the breadth of the implemented functionalities: ++, broad; +, limited; −, nonexistent. Details on the criteria used for assignment to the breadth classes are given in supplementary table S1, Supplementary Material online.