| Literature DB >> 28108551 |
Takahiro Maruki1, Michael Lynch2.
Abstract
Genotype calling plays important roles in population-genomic studies, which have been greatly accelerated by sequencing technologies. To take full advantage of the resultant information, we have developed maximum-likelihood (ML) methods for calling genotypes from high-throughput sequencing data. As the statistical uncertainties associated with sequencing data depend on depths of coverage, we have developed two types of genotype callers. One approach is appropriate for low-coverage sequencing data, and incorporates population-level information on genotype frequencies and error rates pre-estimated by an ML method. Performance evaluation using computer simulations and human data shows that the proposed framework yields less biased estimates of allele frequencies and more accurate genotype calls than current widely used methods. Another type of genotype caller applies to high-coverage sequencing data, requires no prior genotype-frequency estimates, and makes no assumption on the number of alleles at a polymorphic site. Using computer simulations, we determine the depth of coverage necessary to accurately characterize polymorphisms using this second method. We applied the proposed method to high-coverage (mean 18×) sequencing data of 83 clones from a population of Daphnia pulex The results show that the proposed method enables conservative and reasonably powerful detection of polymorphisms with arbitrary numbers of alleles. We have extended the proposed method to the analysis of genomic data for polyploid organisms, showing that calling accurate polyploid genotypes requires much higher coverage than diploid genotypes.Entities:
Keywords: genotype call; polymorphism; population genomics
Mesh:
Year: 2017 PMID: 28108551 PMCID: PMC5427492 DOI: 10.1534/g3.117.039008
Source DB: PubMed Journal: G3 (Bethesda) ISSN: 2160-1836 Impact factor: 3.154
Probability of an observed nucleotide read as a function of the individual genotype g and error rate ε
| Genotype | Nucleotide Read | |||
|---|---|---|---|---|
| 1 ( | ||||
| 2 ( | ||||
| 3 ( | ||||
M and m denote candidate alleles (the two most abundant nucleotide reads in the population sample, e.g., C and T) and e1 and e2 denote other nucleotide reads (e.g., in this case, A and G).
Comparison of the allele-frequency estimates and called genotypes by different methods with low depths of coverage
| Method | Correct-Call Rate among Individuals (mean ± 2 SEM) | Correct-Call Rate among Called Genotypes (mean ± 2 SEM) | |||
|---|---|---|---|---|---|
| Proposed | 0.1 | 0 (HWE) | 0.10 ± 0.00050 | 0.84 ± 0.00064 | 0.91 ± 0.00049 |
| GATK | 0.1 | 0 (HWE) | 0.06 ± 0.00042 | 0.80 ± 0.00073 | 0.88 ± 0.00057 |
| Samtools | 0.1 | 0 (HWE) | 0.08 ± 0.00039 | 0.90 ± 0.00045 | 0.90 ± 0.00045 |
| ANGSD | 0.1 | 0 (HWE) | 0.10 ± 0.00050 | 0.90 ± 0.00045 | 0.90 ± 0.00045 |
| Proposed | 0.1 | Minimized | 0.10 ± 0.00050 | 0.87 ± 0.00068 | 0.95 ± 0.00049 |
| GATK | 0.1 | Minimized | 0.06 ± 0.00041 | 0.82 ± 0.00077 | 0.91 ± 0.00061 |
| Samtools | 0.1 | Minimized | 0.08 ± 0.00039 | 0.94 ± 0.00047 | 0.94 ± 0.00047 |
| ANGSD | 0.1 | Minimized | 0.11 ± 0.00050 | 0.94 ± 0.00046 | 0.94 ± 0.00046 |
| Proposed | 0.1 | Maximized | 0.10 ± 0.00063 | 0.95 ± 0.00046 | 1.00 ± 0.00015 |
| GATK | 0.1 | Maximized | 0.09 ± 0.00058 | 0.93 ± 0.00051 | 1.00 ± 0.00006 |
| Samtools | 0.1 | Maximized | 0.06 ± 0.00046 | 0.91 ± 0.00047 | 0.91 ± 0.00047 |
| ANGSD | 0.1 | Maximized | 0.08 ± 0.00057 | 0.91 ± 0.00047 | 0.91 ± 0.00047 |
| Proposed | 0.3 | 0 (HWE) | 0.30 ± 0.00077 | 0.77 ± 0.00081 | 0.88 ± 0.00080 |
| GATK | 0.3 | 0 (HWE) | 0.22 ± 0.00077 | 0.68 ± 0.00094 | 0.79 ± 0.00088 |
| Samtools | 0.3 | 0 (HWE) | 0.25 ± 0.00096 | 0.84 ± 0.00074 | 0.84 ± 0.00074 |
| ANGSD | 0.3 | 0 (HWE) | 0.30 ± 0.00076 | 0.84 ± 0.00074 | 0.84 ± 0.00074 |
| Proposed | 0.3 | Minimized | 0.30 ± 0.00063 | 0.74 ± 0.00087 | 0.87 ± 0.00075 |
| GATK | 0.3 | Minimized | 0.19 ± 0.00067 | 0.58 ± 0.00098 | 0.70 ± 0.00101 |
| Samtools | 0.3 | Minimized | 0.26 ± 0.00085 | 0.83 ± 0.00078 | 0.83 ± 0.00078 |
| ANGSD | 0.3 | Minimized | 0.31 ± 0.00064 | 0.83 ± 0.00078 | 0.83 ± 0.00078 |
| Proposed | 0.3 | Maximized | 0.30 ± 0.00094 | 0.94 ± 0.00047 | 1.00 ± 0.00017 |
| GATK | 0.3 | Maximized | 0.26 ± 0.00093 | 0.90 ± 0.00061 | 1.00 ± 0.00011 |
| Samtools | 0.3 | Maximized | 0.24 ± 0.00118 | 0.87 ± 0.00067 | 0.87 ± 0.00067 |
| ANGSD | 0.3 | Maximized | 0.28 ± 0.00102 | 0.87 ± 0.00067 | 0.87 ± 0.00067 |
q, and f are the minor-allele frequency, its estimate, and inbreeding coefficient, respectively. by the proposed method and ANGSD are directly estimated from sequence read data by the genotype-frequency estimator (Maruki and Lynch 2015) and Kim method, respectively. Called genotypes by the proposed method are by the Bayesian genotype caller. The correct-call rate among individuals is a fraction of individuals with correctly called genotypes among N = 100 individuals, where missing genotype calls are considered incorrect. On the other hand, the correct-call rate among called genotypes is calculated only among individuals with called genotypes. Mean depth of coverage µ = 3, error rate ε = 0.01. Results are based on a total of 10,000 simulation replications for each parameter set. HWE, Hardy–Weinberg equilibrium.
Comparison of the performance of the genotype-calling methods with human data
| Method | Correct-Call Rate among Individuals (mean ± 2 SEM) | Correct-Call Rate among Called Genotypes (mean ± 2 SEM) |
|---|---|---|
| BGC | 0.954 ± 0.0004 | 0.969 ± 0.0004 |
| GATK | 0.923 ± 0.0004 | 0.943 ± 0.0004 |
| Samtools | 0.966 ± 0.0004 | 0.966 ± 0.0004 |
| ANGSD | 0.967 ± 0.0004 | 0.967 ± 0.0004 |
| GATK + Beagle | 0.941 ± 0.0004 | 0.941 ± 0.0004 |
The correct-call rate among individuals is calculated among individuals with HapMap genotypes, where missing genotype calls are considered incorrect. On the other hand, the correct-call rate among called genotypes is calculated only among individuals with both HapMap genotypes and called genotypes, respectively.
Figure 1Correct-call rate of the high-coverage genotype caller as a function of the depth of coverage. (A) Correct-call rate when the true genotype of the individual is homozygous. (B) Correct-call rate when the true genotype of the individual is heterozygous. Error rate ε = 0.01. Results are based on a total of 10,000 simulation replications for each parameter set.
Performance of different genotype callers for calling diploid genotypes from nucleotide data at triallelic sites
| Method | Mean Coverage | Correct-Call Rate (mean ± 2 SEM) | Correct-Call Rate among Called Genotypes (mean ± 2 SEM) |
|---|---|---|---|
| HGC | 10 | 0.97 ± 0.00037 | 0.97 ± 0.00037 |
| GATK | 10 | 0.97 ± 0.00032 | 0.98 ± 0.00031 |
| Samtools | 10 | 0.96 ± 0.00040 | 0.96 ± 0.00040 |
| BGC | 10 | 0.78 ± 0.00099 | 0.79 ± 0.00099 |
| HGC | 15 | 0.99 ± 0.00022 | 0.99 ± 0.00022 |
| GATK | 15 | 1.00 ± 0.00013 | 1.00 ± 0.00013 |
| Samtools | 15 | 0.96 ± 0.00040 | 0.96 ± 0.00040 |
| BGC | 15 | 0.80 ± 0.00088 | 0.81 ± 0.00087 |
| HGC | 20 | 1.00 ± 0.00014 | 1.00 ± 0.00014 |
| GATK | 20 | 1.00 ± 0.00007 | 1.00 ± 0.00007 |
| Samtools | 20 | 0.97 ± 0.00031 | 0.97 ± 0.00031 |
| BGC | 20 | 0.80 ± 0.00084 | 0.81 ± 0.00083 |
| HGC | 30 | 1.00 ± 0.00005 | 1.00 ± 0.00005 |
| GATK | 30 | 1.00 ± 0.00002 | 1.00 ± 0.00002 |
| Samtools | 30 | 0.99 ± 0.00015 | 0.99 ± 0.00015 |
| BGC | 30 | 0.81 ± 0.00080 | 0.81 ± 0.00080 |
Allele frequencies p, q, and r are 0.7, 0.2, and 0.1, and the population is in Hardy–Weinberg equilibrium. Correct-call rate and that among called genotypes are calculated among N = 100 individuals and those with called genotypes, respectively. Error rate ε = 0.01. Results are based on a total of 10,000 simulation replications for each parameter set.
Figure 2Power analysis of the high-coverage genotype caller. (A) False-positive rate of polymorphism detection as a function of the mean depth of coverage. (B) False-negative rate of detecting polymorphisms at biallelic sites as a function of the minor-allele frequency. The solid curve shows the theoretical minimum value as the probability that just one of the alleles is sampled in a finite sample of size N given by where q is the minor-allele frequency. (C) False-positive rate of detecting triallelic sites as a function of the mean depth of coverage with q = 0.1 (i.e., the site is biallelic). (D) False-negative rate of inferring triallelic sites as a function of the frequency of the rarest allele. The frequency of the second most abundant allele is 0.2. The statistical significance of the likelihood-ratio tests is set at the 5% level in all panels. N = 100, error rate ε = 0.01, inbreeding coefficient f = 0 (Hardy–Weinberg equilibrium). Results are based on a total of 10,000 simulation replications for each parameter set.
Figure 3Correct-call rate of the triploid genotype caller as a function of the depth of coverage. (A) Correct-call rate when the true genotype of the individual is homozygous. (B) Correct-call rate when the true genotype of the individual is heterozygous containing two different nucleotides. (C) Correct-call rate when the true genotype of the individual is heterozygous containing three different nucleotides. Error rate ε = 0.01. Results are based on a total of 10,000 simulation replications for each parameter set.
Summary of the proposed methods
| Method | Ploidy | Advantage | Disadvantage |
|---|---|---|---|
| BGC | Two | More accurate for biallelic SNPs | Assumes at most two alleles |
| HGC | Two | Highly efficient | Genotype-frequency information not used |
| Arbitrary numbers of alleles | Less accurate for biallelic SNPs | ||
| TRI | Three | Highly efficient | Genotype-frequency information not used |
| Arbitrary numbers of alleles | |||
| TET | Four | Highly efficient | Genotype-frequency information not used |
| Arbitrary numbers of alleles |