| Literature DB >> 22494792 |
Stéphane De Mita1, Mathieu Siol.
Abstract
BACKGROUND: With the considerable growth of available nucleotide sequence data over the last decade, integrated and flexible analytical tools have become a necessity. In particular, in the field of population genetics, there is a strong need for automated and reliable procedures to conduct repeatable and rapid polymorphism analyses, coalescent simulations, data manipulation and estimation of demographic parameters under a variety of scenarios.Entities:
Mesh:
Year: 2012 PMID: 22494792 PMCID: PMC3350404 DOI: 10.1186/1471-2156-13-27
Source DB: PubMed Journal: BMC Genet ISSN: 1471-2156 Impact factor: 2.797
Figure 1General architecture and components of the EggLib package. Solid lines denote dependency relationship (A → B denotes that A depends on, and uses, B). Dashed lines indicate optional dependencies.
Features available in EggLib and alternative population genetics software packages
| EggLib | Biopython | PyCogent | Bio++ | DnaSP | ms | CoaSim | DiyABC | ABCToolbox | msABC | ABCreg | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| This paper | [ | [ | [ | [ | [ | [ | [ | [ | [ | [ | |
| Input format | FASTA + converters | Many formats | Many formats | Many formats | Several formats | Genepop format | Specific format | Tabular data | |||
| Alignment | Available (wrappers) | Available (wrappers) | Available (wrappers) | ||||||||
| Storage model | Full storage in memory | Full storage (alignments) and iterative parsing | Full storage in memory | Full storage in memory | Full storage in memory | ||||||
| BLAST wrapper | Available | Available | Available | ||||||||
| Gene prediction | Available | ||||||||||
| Microsatellites | Built-in | Genepop wrapper | |||||||||
| Sequences | Built-in | Built-in | Built-in | From simulations | |||||||
| Coding sequences | With Bio++ | Built-in | |||||||||
| Distance and maximum-likelihood methods through wrappers | Built-in distance and maximum likelihood methods + wrappers | Built-in distance and maximum likelihood methods | |||||||||
| Coalescence (standard model) | Built-in and ms wrapper | ms wrapper | Available | Available | Available | - | |||||
| Recombination | Available | Available | Available | Available | Available | ||||||
| Structured models | Available | Available | Available | Available | |||||||
| Diploid samples & selfing | Available | ||||||||||
| Infinite-site model | Available | Available | Available | Fixed number of sites | |||||||
| Homoplasy | Available | Available | Available | ||||||||
| Microsatellite models | Available | Available | Available | ||||||||
| Output | Sequences, FASTA, trees, statistics, Python objects | Arlequin-compatible file | P-values | Sequences, statistics | Sequences, Python objects | ||||||
| Models | Pre-defined models + all models allowed by the simulator (not restrictive) | Customizable divergence models with population size changes | Depends on the simulator used | All models allowed by ms | |||||||
| Summary statistics | Pre-defined statistics sets + all statistics available in EggLib (not restrictive) | Microsatellite and within- and between-population sequence statistics | Calculated by simulator or provided by the user | Within- and between-population sequence statistics | |||||||
| Analysis method | Rejection and local-linear regression | Rejection and local-linear regression | Rejection, local-linear regression, generalized linear models and others | Rejection and local-linear regression | |||||||
Figure 2Effect of missing data and quality threshold on the detection of polymorphic sites. Estimates of the number of polymorphic sites as a function of the proportion of missing data for different quality thresholds (red = 100%, magenta = 90%, green = 50%, blue = 10%). The simulations parameters are as follow: number of segregating sites = 30; sample size = 40; only polymorphic sites are generated and analyzed; for each value of the proportion of mission data, nucleotides are replaced by N's by random sampling without replacement. Each point represents the average over 5000 repetitions.
Figure 3Example of diversity analysis implemented in Python using egglib-py. This script imports 100 FASTA-formatted alignments, performs a basic diversity analysis and finally compares the average Tajima's D statistic to a number of neutral coalescent simulations under the standard model. Lines 16, 19, and 44-46 are commented in the text. All operations are performed using the Align class and the simul module of egglib-py (full documentation is included in the reference manual available online).
Figure 4Code example: User-defined ABC model. Example of user-defined demographic model extending EggLib's pre-implemented ABC models. A graphical representation of the model is showed at the top of the picture, and the code to implement it is showed at the bottom. Explanations can be found in the main text.
Running time and memory use while importing FASTA files
| File | EggLib | Biopython | ||
|---|---|---|---|---|
| Time (s) | Memory (MB) | Time (s) | Memory (MB) | |
| Large alignment (96.5 MB) | 2.19 | 115.5 | 2.48 | 129.6 |
| 2.39 | 100.4 | 5.12 | 313.8 | |
| 7.83 | 396.4 | 11.49 | 401.0 | |
Note: The large alignment contains 10,000 sequences of 10,000 bp. The coding sequences of the Oryza sativa genome represent 67,393 sequences ranging from 153 to 16,311 bp while its pseudomolecules represent 12 sequences ranging from 23,011,239 to 43,268,879 bp.
Running time and memory use while performing diversity analyses
| File | EggLib | libsequence | ||
|---|---|---|---|---|
| Time (s) | Memory (MB) | Time (s) | Memory (MB) | |
| 1000 files (49.8 MB) minimal | 4.17 | 9.3 | - | - |
| 1000 files (49.8 MB) standard | 9.54 | 9.5 | 12.34 | 1.8 |
| 1000 files (49.8 MB) LD | 26.43 | 151.7 | 47.87 | 124.8 |
| 1 file (33.0 MB) minimal | 4.35 | 104.0 | - | - |
| 1 file (33.0 MB) standard | 6.84 | 92.6 | 2.63 | 44.1 |
| 1 file (6.0 KB) coding | 0.16 | 8.7 | 0.06 | 0.1 |
Note: We analyzed 1000 simulated alignments of 50 sequences (plus one outgroup) of 1000 bp and a single alignment of 7 sequences of 4,920,321 bp. A subset of this alignment containing 6 sequences of 999 bp was analyzed for coding statistics. The minimal set of statistics was the number of polymorphic sites, θ estimators and Tajima's D. The standard set of statistics included minimal statistics plus haplotype-based statistics. Linkage disequilibrium (LD) was computed between polymorphic sites. For coding sequences, non-synonymous and synonymous θ estimators were calculated (for EggLib, the functions of Bio++ are called).
Running time and memory use while performing coalescent simulations
| Model | Egglib | ms | CoaSim | |||
|---|---|---|---|---|---|---|
| Time (s) | Memory (MB) | Time (s) | Memory (MB) | Time (s) | Memory (MB) | |
| standard | 7.68 | 48 | 1.27 | 43 | 16.67 | 80 |
| recombination | 8.77 | 53 | 1.99 | 44 | 16.45 | 79 |
| structured | 7.65 | 48 | 1.50 | 42 | 20.75 | 79 |
Note: All three models (standard, recombination and structured) have 40 sequences with a fixed number of mutations of 100. 10,000 repetitions were run for each model. For the model with recombination, the scaled recombination parameter was set to 5 for all programs and the number of recombining segments was set to 1000 for eggcoal and ms (CoaSim does not require this parameter). For the structured model, 4 populations of 10 samples with a migration rate of 1 were simulated. The populations joined 10 coalescent time units in the past.
Running time and memory use while performing ABC
| Simulation step | Egglib | msABC | ||
|---|---|---|---|---|
| Model + summary statistics | Time (s) | Memory (MB) | Time (s) | Memory (MB) |
| SNM + SDZ | 13.71 | 25.6 | 7.24 | 8.9 |
| SNMR + SDZ | 27.09 | 55.6 | 26.10 | 8.8 |
| PEMR + SDZ | 16.72 | 44.8 | 13.46 | 8.6 |
| BNM + SDZ | 15.68 | 37.6 | 8.27 | 9.1 |
| IM + SDZ | 40.06 | 70.3 | 21.52 | 14.2 |
| AM + SDZ | 25.11 | 57.8 | * | * |
| SNM + SFS | 15.83 | 25.8 | - | - |
| SNMR + SFS | 29.85 | 55.6 | - | - |
| PEMR + SFS | 18.05 | 44.9 | - | - |
| BNM + SFS | 18.22 | 36.5 | - | - |
| IM + SFS | 46.94 | 63.6 | - | - |
| AM + SFS | 29.15 | 51.6 | - | - |
| Data file: 830 MB | 70.82 | 131.0 | 30.74 | 628.7 |
Note: Models: standard neutral model (SNM), standard neutral model with recombination (SNMR), population expansion model with recombination (PEMR), bottleneck model (BNM), island model with two populations (IM), admixture model (AM). Uniform prior bounds: 0-0.05 (per site) for the mutation and recombination rates, 0.01-1 for the migration rate, 0-1 for date/duration parameters, 0-1 for the population size during bottleneck 0-10 for the ancestral population size. Summary statistics sets: SDZ (number of polymorphic sites, Tajima's D and Fay and Wu's H), SFS (site frequency spectrum with 8 categories). The SFS was available only with EggLib. 20 loci of 40 sequences 1000 bp-long were analyzed and each ABC simulation run generated 1000 data samples. For the analysis phase, a large data set of 5,000,000 samples (containing two varying model parameters and nine statistics) was used. EggLib was used through the interactive commands abc_sample and abc_fit. (*) The AM model could not be implemented with msABC.