| Literature DB >> 25420514 |
Thorfinn Sand Korneliussen1, Anders Albrechtsen2, Rasmus Nielsen3,4.
Abstract
BACKGROUND: High-throughput DNA sequencing technologies are generating vast amounts of data. Fast, flexible and memory efficient implementations are needed in order to facilitate analyses of thousands of samples simultaneously.Entities:
Mesh:
Year: 2014 PMID: 25420514 PMCID: PMC4248462 DOI: 10.1186/s12859-014-0356-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1Data formats and call graph. A) Dependency of different data formats and analyses that can be performed in ANGSD. B) Simplified call graph. Red nodes indicate areas that are not threaded. With the exception of file readers, all analyses, printing and cleaning is done by objects derived from the abstract base class called general.
Overview of analyses implemented in ANGSD
|
|
|
|
|---|---|---|
|
| BC | [ |
|
| GL | [ |
|
| BC | [ |
|
| BC/Seq | [ |
|
| BC/GL/GP | [ |
|
| GL | [ |
|
| GL/SAF | [ |
|
| GL/GP | [ |
| Population differentiation statistics | SAF | [ |
| Population structure via principle components analysis | GP | [ |
|
| GL | [ |
| Detection of ancient admixture | BC | [ |
| Estimation of | SAF | [ |
| Estimation of | SAF | |
|
| SAF | [ |
| Estimation of individual and site-wise | GL | [ |
|
| GL | [ |
|
| GL-GP | [ |
Table of the supported analyses in ANGSD. indicates methods that require a secondary program in ANGSD package. indicates methods for which ANGSD is the de facto implementation and are user supplied extensions for ANGSD. The basis for each analysis is either the sequencing data (Seq), base counts (BC), genotype likelihood (GL), sample allele frequencies (SAF) or genotype probabilities (GP).
Figure 21D SFS for different GL models. SFS estimation based on a 170 megabase region from chromosome 1 using 12 CEU samples A) and 14 YRI samples B)” from the 1000 genomes project. The analysis was performed for both the GATK GL model (green, light brown) and SAMtools GL (yellow,dark brown). Notice the difference in estimated variability (proportion of variable sites) for the two GL models, with GATK GL based analyses inferring more variable sites and an associated larger proportion of low-frequency alleles. The two categories of invariable sites have been removed and the distributions have been normalized so that the frequencies of all categories sum to one for each method.
Figure 3Joint SFS (2D-SFS). Two dimensional SFS estimation based on a 170 megabase region from chromosome 1 using 12 CEU samples and 14 YRI samples from the 1000 genomes project.
D-stat results for modern samples
|
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HGDP00521 (French) | HGDP00998 (American) | HGDP00927 (Yoruba) | 355539 | 360029 | -0.01 | -0.01 | 0.00 | -1.40 |
| 2 | HGDP00521 (French) | HGDP00778 (Han china) | HGDP00927 (Yoruba) | 361594 | 369006 | -0.01 | -0.01 | 0.00 | -2.40 |
| 3 | HGDP00998 (American) | HGDP00778 (Han china) | HGDP00927 (Yoruba) | 332227 | 334990 | -0.00 | -0.00 | 0.00 | -0.90 |
| 4 | HGDP00521 (French) | HGDP00542 (Papuan1) | HGDP00927 (Yoruba) | 360153 | 383994 | -0.03 | -0.03 | 0.00 | -6.80 |
| 5 | HGDP00998 (American) | HGDP00542 (Papuan1) | HGDP00927 (Yoruba) | 347593 | 366979 | -0.03 | -0.03 | 0.00 | -5.80 |
| 6 | HGDP00778 (Han china) | HGDP00542 (Papuan1) | HGDP00927 (Yoruba) | 347017 | 363467 | -0.02 | -0.02 | 0.00 | -5.20 |
| 7 | HGDP00927 (Yoruba) | HGDP00998 (American) | HGDP00521 (French) | 653515 | 360029 | 0.29 | 0.29 | 0.00 | 60.60 |
| 8 | HGDP00927 (Yoruba) | HGDP00778 (Han china) | HGDP00521 (French) | 639280 | 369006 | 0.27 | 0.27 | 0.01 | 53.00 |
| 9 | HGDP00998 (American) | HGDP00778 (Han china) | HGDP00521 (French) | 384915 | 407967 | -0.03 | -0.03 | 0.01 | -5.40 |
| 10 | HGDP00927 (Yoruba) | HGDP00542 (Papuan1) | HGDP00521 (French) | 626366 | 383994 | 0.24 | 0.24 | 0.01 | 43.10 |
| 11 | HGDP00998 (American) | HGDP00542 (Papuan1) | HGDP00521 (French) | 399343 | 450303 | -0.06 | -0.06 | 0.01 | -10.10 |
| 12 | HGDP00778 (Han china) | HGDP00542 (Papuan1) | HGDP00521 (French) | 405942 | 433790 | -0.03 | -0.03 | 0.01 | -5.50 |
| 13 | HGDP00927 (Yoruba) | HGDP00521 (French) | HGDP00998 (American) | 653515 | 355539 | 0.30 | 0.30 | 0.00 | 61.20 |
| 14 | HGDP00927 (Yoruba) | HGDP00778 (Han china) | HGDP00998 (American) | 711281 | 334990 | 0.36 | 0.36 | 0.01 | 71.80 |
| 15 | HGDP00521 (French) | HGDP00778 (Han china) | HGDP00998 (American) | 486385 | 407967 | 0.09 | 0.09 | 0.01 | 15.10 |
| 16 | HGDP00927 (Yoruba) | HGDP00542 (Papuan1) | HGDP00998 (American) | 660154 | 366979 | 0.29 | 0.29 | 0.01 | 53.80 |
| 17 | HGDP00521 (French) | HGDP00542 (Papuan1) | HGDP00998 (American) | 445929 | 450303 | -0.00 | -0.00 | 0.01 | -0.80 |
| 18 | HGDP00778 (Han china) | HGDP00542 (Papuan1) | HGDP00998 (American) | 394958 | 477720 | -0.09 | -0.09 | 0.01 | -15.30 |
| 19 | HGDP00927 (Yoruba) | HGDP00521 (French) | HGDP00778 (Han china) | 639280 | 361594 | 0.28 | 0.28 | 0.00 | 57.00 |
| 20 | HGDP00927 (Yoruba) | HGDP00998 (American) | HGDP00778 (Han china) | 711281 | 332227 | 0.36 | 0.36 | 0.01 | 72.70 |
| 21 | HGDP00521 (French) | HGDP00998 (American) | HGDP00778 (Han china) | 486385 | 384915 | 0.12 | 0.12 | 0.01 | 20.80 |
| 22 | HGDP00927 (Yoruba) | HGDP00542 (Papuan1) | HGDP00778 (Han china) | 666222 | 363467 | 0.29 | 0.29 | 0.01 | 55.10 |
| 23 | HGDP00521 (French) | HGDP00542 (Papuan1) | HGDP00778 (Han china) | 459135 | 433790 | 0.03 | 0.03 | 0.01 | 4.70 |
| 24 | HGDP00998 (American) | HGDP00542 (Papuan1) | HGDP00778 (Han china) | 401357 | 477720 | -0.09 | -0.09 | 0.01 | -14.20 |
| 25 | HGDP00927 (Yoruba) | HGDP00521 (French) | HGDP00542 (Papuan1) | 626366 | 360153 | 0.27 | 0.27 | 0.01 | 54.00 |
| 26 | HGDP00927 (Yoruba) | HGDP00998 (American) | HGDP00542 (Papuan1) | 660154 | 347593 | 0.31 | 0.31 | 0.01 | 60.60 |
| 27 | HGDP00521 (French) | HGDP00998 (American) | HGDP00542 (Papuan1) | 445929 | 399343 | 0.06 | 0.06 | 0.01 | 9.50 |
| 28 | HGDP00927 (Yoruba) | HGDP00778 (Han china) | HGDP00542 (Papuan1) | 666222 | 347017 | 0.32 | 0.32 | 0.01 | 61.90 |
| 29 | HGDP00521 (French) | HGDP00778 (Han china) | HGDP00542 (Papuan1) | 459135 | 405942 | 0.06 | 0.06 | 0.01 | 10.40 |
| 30 | HGDP00998 (American) | HGDP00778 (Han china) | HGDP00542 (Papuan1) | 401357 | 394958 | 0.01 | 0.01 | 0.01 | 1.30 |
Results of the ABBABABA analysis for modern individuals from the human genetic diversity panel.
D-stat for ancient sample
|
|
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|---|---|
| 1 | HGDP00927 (Yoruba) | HGDP00542 (Papuan1) | T_hg19_1000g (Denisova) | 103016 | 90667 | 0.06 | 0.06 | 0.01 | 12.10 |
| 2 | T_hg19_1000g (Denisova) | HGDP00542 (Papuan1) | HGDP00927 (Yoruba) | 286551 | 90667 | 0.52 | 0.52 | 0.00 | 127.10 |
| 3 | T_hg19_1000g (Denisova) | HGDP00927 (Yoruba) | HGDP00542 (Papuan1) | 286551 | 103016 | 0.47 | 0.47 | 0.01 | 88.60 |
Results of the ABBABABA analysis for 2 modern individuals and one ancient sample.
Figure 4Overlap between inferred SNPs with a critical p-value threshold of and not using BAQ. Venn diagram of the overlap between the SNP discovery for ANGSD, GATK and SAMtools for 33 CEU samples for chromosome 1. We used default parameters with GATK for SAMtools we discarded reads with a mapping quality below 10. For ANGSD we choose an p-value threshold of 10−6 and didn’t enable BAQ. In A, we used the SAMtools genotype likelihood model in ANGSD, in B we used the GATK model in ANGSD.
Figure 5Error rate vs call rate for called genotypes. Error rate and call rates for genotype calls based on different methods. The error rate is defined as the discordance rate between HapMap genotype calls compared to the same individuals sequenced in the 1000 genomes. Genotype where called for all sites for all individuals for all methods. Each genotype call has a score which was used to determine the call rate. Due to the discrete nature of some of the genotype scores we obtain a jagged curve.
Computational speed of GATK,SAMtools and ANGSD
|
|
|
|
|
|
|
| |
|---|---|---|---|---|---|---|---|
| 50 Samples | 2722 | 1706 | 602 | 1744 | 1171 | 1765 | 1646 |
| 100 Samples | 5097 | 4049 | 1270 | 4143 | 2457 | 4373 | 4013 |
| 200 Samples | 10615 | 9672 | 2704 | 9951 | 5032 | 10330 | 7352 |
Wallclock time (not CPU) measured in seconds for different samples sizes and different number of allocated cores. Commands used are found in Additional file 7. We did the analysis twice (in different order) and picked the lowest value. Notice that the runtime for GATK and ANGSD does not decrease with 2 and 4 threads. This could be an indication that the file reading is the bottleneck.