| Literature DB >> 33122697 |
Ruoyun Hui1, Eugenia D'Atanasio2,3, Lara M Cassidy4, Christiana L Scheib5,6, Toomas Kivisild7,8.
Abstract
Although ancient DNA data have become increasingly more important in studies about past populations, it is often not feasible or practical to obtain high coverage genomes from poorly preserved samples. While methods of accurate genotype imputation from > 1 × coverage data have recently become a routine, a large proportion of ancient samples remain unusable for downstream analyses due to their low coverage. Here, we evaluate a two-step pipeline for the imputation of common variants in ancient genomes at 0.05-1 × coverage. We use the genotype likelihood input mode in Beagle and filter for confident genotypes as the input to impute missing genotypes. This procedure, when tested on ancient genomes, outperforms a single-step imputation from genotype likelihoods, suggesting that current genotype callers do not fully account for errors in ancient sequences and additional quality controls can be beneficial. We compared the effect of various genotype likelihood calling methods, post-calling, pre-imputation and post-imputation filters, different reference panels, as well as different imputation tools. In a Neolithic Hungarian genome, we obtain ~ 90% imputation accuracy for heterozygous common variants at coverage 0.05 × and > 97% accuracy at coverage 0.5 ×. We show that imputation can mitigate, though not eliminate reference bias in ultra-low coverage ancient genomes.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33122697 PMCID: PMC7596702 DOI: 10.1038/s41598-020-75387-w
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Schematic representation of the imputation pipeline. The input and output of the starting down-sampling step are both alignment files in BAM format. The output of each step of the pipeline (genotype calling, genotype probability update and genotype imputation) is a VCF file. In the output boxes, data fields that are updated and necessary in the following step of the pipeline are highlighted in green (1KG: 1000 Genomes).
Imputation accuracy of the default pipeline across all coverages in NE1 chromosome 20.
| Minor allele frequency bins | Coverages | ||||||
|---|---|---|---|---|---|---|---|
| 0.05 × | 0.1 × | 0.5 × | 0.75 × | 1 × | 1.5 × | 2 × | |
| 0.001–0.01 | 0.490 | 0.614 | 0.720 | 0.731 | 0.743 | 0.753 | 0.765 |
| 0.01–0.05 | 0.574 | 0.712 | 0.835 | 0.844 | 0.849 | 0.858 | 0.869 |
| 0.05–0.1 | 0.837 | 0.881 | 0.944 | 0.947 | 0.949 | 0.951 | 0.959 |
| 0.1–0.3 | 0.871 | 0.922 | 0.972 | 0.972 | 0.977 | 0.975 | 0.979 |
| > 0.3 | 0.923 | 0.955 | 0.984 | 0.982 | 0.984 | 0.982 | 0.985 |
| Common variants (MAF ≥ 0.05) | 0.891 | 0.933 | 0.975 | 0.974 | 0.977 | 0.976 | 0.980 |
| 0.001–0.01 | 0.994 | 0.994 | 0.994 | 0.994 | 0.994 | 0.994 | 0.994 |
| 0.01–0.05 | 0.991 | 0.991 | 0.993 | 0.993 | 0.993 | 0.993 | 0.994 |
| 0.05–0.1 | 0.983 | 0.984 | 0.991 | 0.992 | 0.992 | 0.992 | 0.993 |
| 0.1–0.3 | 0.967 | 0.974 | 0.988 | 0.989 | 0.990 | 0.989 | 0.991 |
| > 0.3 | 0.961 | 0.966 | 0.986 | 0.987 | 0.988 | 0.988 | 0.989 |
| Common variants (≥ 0.05) | 0.970 | 0.974 | 0.988 | 0.989 | 0.990 | 0.989 | 0.991 |
Accuracies are also shown for various MAF bins and genotypes in the full-coverage genome. Accuracies for homozygous sites are presented in Table S1; the actual numbers of correctly imputed sites are presented in Table S2.
Figure 2Imputation accuracy of heterozygous sites following the default pipeline evaluated by down-sampling NE1 chr20. The main figure shows the accuracy across coverages (on a log scale, X-axis), with and without the post-calling deamination filter. The inset on the top-right corner shows the proportion of heterozygous sites called in the original 20 × genome that are correctly imputed (i.e. not imputed as homozygous or failing the post-imputation filter).
Figure 3Comparing performance between one-step and two-step imputation pipelines. Two-step pipelines have a pre-imputation filter applied: max(GP) ≥ 0.99 for Beagle 4.0 + Beagle 5 and Beagle 4.1 + Beagle 5; max(GP) ≥ 0.9 for GLIMPSE + Beagle 5. In the lower panel, post-imputation GP filters are max(GP) ≥ 0.9 for GLIMPSE and max(GP) ≥ 0.99 for all the others. We used a more relaxed cutoff for GPs generated by GLIMPSE because these appear more conservative than GPs generated by Beagle 4 and 5 (Table S3).
Figure 4Effect of different settings on imputation accuracy evaluated by down-sampling NE1. (A) Performance using different genotype callers in a 0.05 × coverage genome; (B) Effect of pre-imputation filters in a 0.05 × coverage genome; (C) Effect of post-imputation filters at in a 0.05 × coverage genome; (D) Performance using different reference panels during the genotype probability update and imputation steps in a 0.05 × coverage genome. The inset on the top-right corner shows the proportion of heterozygous sites called in the original 20 × genome that are correctly imputed (i.e. not imputed as homozygous or failing the post-imputation filter). tv: transversion.