| Literature DB >> 25294605 |
Stinus Lindgreen1, Anders Krogh, Jakob Skou Pedersen.
Abstract
BACKGROUND: As the use of next-generation sequencing technologies is becoming more widespread, the need for robust software to help with the analysis is growing as well. A key challenge when analyzing sequencing data is the prediction of genotypes from the reads, i.e. correct inference of the underlying DNA sequences that gave rise to the sequenced fragments. For diploid organisms, the genotyper should be able to predict both alleles in the individual. Variations between the individual and the population can then be analyzed by looking for SNPs (single nucleotide polymorphisms) in order to investigate diseases or phenotypic features. To perform robust and high confidence genotyping and SNP calling, methods are needed that take the technology specific limitations into account and can model different sources of error. As an example, ancient DNA poses special challenges as the data is often shallow and subject to errors induced by post mortem damage.Entities:
Mesh:
Year: 2014 PMID: 25294605 PMCID: PMC4203901 DOI: 10.1186/1756-0500-7-698
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Performance of SNPest, GeMS and FreeBayes on real data
| SNPest | GeMS | FreeBayes | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Bowtie2 | BWA-PSSM | Bowtie2 | BWA-PSSM | Bowtie2 | BWA-PSSM | |||||||
| Depth | All | QC | All | QC | All | QC | All | QC | All | QC | All | QC |
| 5 | 3 | 0 | 1 | 0 | 5 | 0 | 36 | 6 | ||||
| 10 | 0 | 0 | 0 | 0 | 5 | 0 | 32 | 2 | ||||
| 20 | 0 | 0 | 0 | 0 | 5 | 0 | 32 | 2 | ||||
| 30 | 0 | 0 | 0 | 0 | 5 | 0 | 32 | 2 | ||||
| 40 | 0 | 0 | 0 | 0 | 5 | 0 | 32 | 2 | ||||
| 50 | 0 | 0 | 0 | 0 | 5 | 0 | 32 | 2 | ||||
| 60 | 0 | 0 | 0 | 0 | 5 | 0 | 32 | 2 | 170 | 0 | 125 | 0 |
SNPest and GeMS use various fractions of the available data (from maximum 5 reads per site to maximum 60 (in this case all) reads per position), and FreeBayes is only run using all available data. The REL606 strain of E. coli was sequenced on the MiSeq platform to an average depth of 27X. Residual adapters were removed using AdapterRemoval, and the cleaned reads were mapped using two different mappers, Bowtie2 and BWA-PSSM. No SNPs are expected in this mapping, as we are mapping a known sequence back to itself. SNPest used the reference genome as a prior (see Additional file 1 for more results). All: All SNP candidates. QC: Number of SNPs after filtering on quality (SNPest and FreeBayes: Genotype quality of >30. GeMS: Dixon Q-test p-value <0.01).
Simulated ancient DNA based on the REL606 genome
| SNPest, standard model | SNPest, damage model | |||||||
|---|---|---|---|---|---|---|---|---|
| Bowtie2 | BWA-PSSM | Bowtie2 | BWA-PSSM | |||||
| Depth | All | QC | All | QC | All | QC | All | QC |
| 5 | 198 | 11 | 763 | 80 | 66 | 0 | 236 | 4 |
| 10 | 55 | 0 | 184 | 0 | 51 | 0 | 154 | 0 |
| 20 | 55 | 0 | 183 | 0 | 51 | 0 | 154 | 0 |
| 30 | 55 | 0 | 183 | 0 | 51 | 0 | 154 | 0 |
| 40 | 55 | 0 | 183 | 0 | 51 | 0 | 154 | 0 |
| 50 | 55 | 0 | 183 | 0 | 51 | 0 | 154 | 0 |
| 60 | 55 | 0 | 183 | 0 | 51 | 0 | 154 | 0 |
Read lengths of 36 bp and an average depth of 27X was simulated, and DNA damage was simulated in the reads as described in the main text. SNPest was run in the haploid mode, without using the reference genome, and both with and without the damage model.
Simulated ancient DNA based on the REL606 genome
| GeMS | FreeBayes | |||||||
|---|---|---|---|---|---|---|---|---|
| Bowtie2 | BWA-PSSM | Bowtie2 | BWA-PSSM | |||||
| Depth | All | QC | All | QC | All | QC | All | QC |
| 5 | 88 | 32 | 298 | 97 | ||||
| 10 | 55 | 15 | 178 | 46 | ||||
| 20 | 55 | 15 | 177 | 46 | ||||
| 30 | 55 | 15 | 177 | 46 | ||||
| 40 | 55 | 15 | 177 | 46 | ||||
| 50 | 55 | 15 | 177 | 46 | ||||
| 60 | 55 | 15 | 177 | 46 | 2808 | 0 | 2680 | 0 |
Read lengths of 36 bp and an average depth of 27X was simulated, and DNA damage was simulated in the reads as described in the main text. GeMS and FreeBayes were both run in haploid mode, GeMS with varying maximum read depths, and FreeBayes using all data.
Figure 1Predicted SNPs on low depth diploid data from human chromosome 20. The Venn diagram shows the performance of the five genotypers used (SNPest without using the reference genome, FreeBayes, SAMtools with bcftools, GATK’s HaplotypeCaller and GeMS) and illustrates the overlap in predicted SNPs between every combination of methods.
Results on low depth, diploid data from human chromosome 20
| Program | #SNPs | SNP rate | dbSNP | SNPest | Excl. | Indels | dbSNP | Homo:hetero |
|---|---|---|---|---|---|---|---|---|
| SNPest | 14,159 | 0.02% | 99.42% | 100.00% | 0.13% | 454 | 59.03% | 0.64 |
| FreeBayes | 3,175 | 0.01% | 98.90% | 5.32% | 2.58% | 330 | 60.91% | 1.12 |
| SAMtools | 65,120 | 0.11% | 99.01% | 99.46% | 1.76% | 6,918 | 60.18% | 0.66 |
| GATK | 54,441 | 0.09% | 99.44% | 97.84% | 1.17% | 7,773 | 60.77% | 1.09 |
| GeMS | 73,694 | 0.13% | 87.59% | 99.24% | 17.95% | N/A | N/A | 0.62 |
The results from SNPest (without using the reference genome), FreeBayes, SAMtools with bcftools, GATK’s HaplotypeCaller and GeMS are shown. For each method, we report the number of high quality SNPs, the SNP rate, the fraction overlap with dbSNP 139, the fraction of SNPest predictions in common, the fraction of exclusive SNPs only predicted by this method, number of insertions/deletions, fraction of insertions/deletions found in dbSNP 139, and homozygous:heterozygous ratio for SNPs.
Figure 2Predicted SNPs on high depth diploid data from human chromosome 22. The Venn diagram shows the performance of the five genotypers used (SNPest without using the reference genome, FreeBayes, SAMtools with bcftools, GATK’s HaplotypeCaller and GeMS) and illustrates the overlap in predicted SNPs between every combination of methods.
Results on high depth, diploid data from human chromosome 22
| Program | #SNPs | SNP rate | dbSNP | SNPest | Excl. | Indels | dbSNP | Homo:hetero |
|---|---|---|---|---|---|---|---|---|
| SNPest | 40,997 | 0.12% | 99.17% | 100.00% | 0.86% | 82 | 57.32% | 0.46 |
| FreeBayes | 11,570 | 0.03% | 87.99% | 14.52% | 29.55% | 511 | 60.86% | 4.67 |
| SAMtools | 43,679 | 0.13% | 99.37% | 96.92% | 0.40% | 3,880 | 57.45% | 0.49 |
| GATK | 43,721 | 0.13% | 99.29% | 95.40% | 2.77% | 5,660 | 56.29% | 0.58 |
| GeMS | 51,117 | 0.15% | 89.22% | 97.78% | 13.68% | N/A | N/A | 0.50 |
The results from SNPest (without using the reference genome), FreeBayes, SAMtools with bcftools, GATK’s HaplotypeCaller and GeMS are shown. For each method, we report the number of high quality SNPs, the SNP rate, the fraction overlap with dbSNP 139, the fraction of SNPest predictions in common, the fraction of exclusive SNPs only predicted by this method, number of insertions/deletions, fraction of insertions/deletions found in dbSNP 139, and homozygous:heterozygous ratio for SNPs.
Figure 3The graphical model used in SNPest. Circles represent random variables. The two top RVs are global for a given position, whereas the boxed part of the model denotes the n individual reads covering the specific position.