| Literature DB >> 31348818 |
Torsten Günther1, Carl Nettelblad2.
Abstract
Haploid high quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map successfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele. In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50 bp-reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudo-haploid data, i.e. they randomly sample only one sequencing read per site. We show that reference bias is pervasive in published ancient DNA sequence data of prehistoric humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Most genomic regions we investigated show little to no mapping bias but even a small proportion of sites with bias can impact analyses of those particular loci or slightly skew genome-wide estimates. Therefore, reference bias has the potential to cause minor but significant differences in the results of downstream analyses such as population allele sharing, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.Entities:
Mesh:
Substances:
Year: 2019 PMID: 31348818 PMCID: PMC6685638 DOI: 10.1371/journal.pgen.1008302
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Information on the published medium to high coverage palaeogenomic and archaeogenomic data used in this study.
| Sample ID | (Partial) UDG treatment | SNP capture | Average sequencing depth± SE | Number of SNPs | Reference |
|---|---|---|---|---|---|
| Stuttgart | X | 16.0 ± 7.0 | 916,374 | [ | |
| Loschbour | X | 18.2 ± 7.6 | 950,730 | [ | |
| Ust’-Ishim | X | 31.8 ± 8.0 | 1,016,515 | [ | |
| sf12 | X | 66.3 ± 19.5 | 1,018,604 | [ | |
| baa001 | X | 13.7 ± 7.0 | 807,484 | [ | |
| Kotias | 13.3 ± 7.7 | 732,775 | [ | ||
| Bichon | 15.4 ± 10.2 | 731,151 | [ | ||
| ne1 | 19.5 ± 7.5 | 971,184 | [ | ||
| br2 | 19.3 ± 6.1 | 993,160 | [ | ||
| atp016 | 14.6 ± 6.4 | 888,522 | [ | ||
| Rathlin1 | 11.2 ± 6.1 | 655,195 | [ | ||
| Ballynahatty | 11.2 ± 6.1 | 655,947 | [ | ||
| I0054 | X | X | 2.9 ± 8.1 | 73,901 | [ |
| I0103 | X | X | 2.8 ± 7.2 | 79,825 | [ |
| I0118 | X | X | 2.0 ± 4.8 | 70,872 | [ |
| I0172 | X | X | 4.0 ± 9.4 | 94,558 | [ |
| I0408 | X | X | 2.0 ± 6.6 | 58,358 | [ |
| I0412 | X | X | 2.2 ± 6.1 | 64,057 | [ |
| I0585 | X | X | 2.9 ± 8.3 | 63,176 | [ |
| TAF011 | X | X | 1.1 ± 3.4 | 37,737 | [ |
| TAF013 | X | X | 0.9 ± 3.0 | 30,390 | [ |
| TAF014 | X | X | 0.8 ± 2.4 | 22,648 | [ |
| AltaiNeandertal | X | 47.7 ± 12.3 | 1,019,927 | [ | |
| VindijaNeandertal | X | 28.1 ± 9.8 | 1,009,725 | [ | |
| Denisovan | X | 28.5 ± 8.1 | 1,013,485 | [ |
‡ Samples for which unmapped reads were obtained
$ Enzymatic repair of deamination damages
† at 1,022,984 analyzed SGDP SNPs, using a minimum mapping quality of 30
Fig 1Reference bias in published genome-wide ancient DNA datasets for different minimum mapping quality thresholds.
The plot shows the average proportion of reads at heterozygous transversion sites (see Methods) representing the alternative allele. Error bars indicate two standard errors of the mean.
Proportion of alternative alleles when mapping back original reads and virtual opposite allele reads for the sf12 and Ust’-Ishim individuals.
| Proportion of alternative alleles | Individual | |||
|---|---|---|---|---|
| sf12 | Ust’-Ishim | |||
| # SNPs | Percentage | # SNPs | Percentage | |
| Filtered for uniquely mapping SNPs | ||||
| 0 | 19 | 0.00% | 24 | 0.00% |
| (0, 0.4) | 908 | 0.09% | 687 | 0.07% |
| [0.4, 0.5) | 329896 | 32.26% | 111894 | 10.94% |
| 0.5 | 644746 | 63.04% | 898825 | 87.90% |
| (0.5, 0.6] | 46964 | 4.59% | 11080 | 1.08% |
| (0.6, 1) | 113 | 0.01% | 88 | 0.01% |
| 1 | 11 | 0.00% | 7 | 0.00% |
| Unfiltered SNPs | ||||
| 0 | 96 | 0.01% | 1228 | 0.07% |
| (0, 0.4) | 16177 | 0.96% | 27425 | 1.62% |
| [0.4, 0.5) | 687703 | 40.64% | 282667 | 16.71% |
| 0.5 | 912384 | 53.92% | 1359770 | 80.41% |
| (0.5, 0.6] | 75448 | 4.46% | 19704 | 1.17% |
| (0.6, 1) | 325 | 0.02% | 286 | 0.02% |
| 1 | 36 | 0.00% | 34 | 0.00% |
Fig 2Connection between fragment length and reference bias.
(A) Proportion of alternative allele for different fragment length bins in the high coverage individual sf12. (B) Correlation between average proportion of alternative alleles and the mode of the fragment size distribution across all investigated individuals. (C) Proportion of heterozygous sites among all sites with sufficient coverage for different fragment length bins in the high coverage individual sf12. All error bars indicate two standard errors.
Fig 3D statistics testing the affinity between different modern populations (X) and two different treatments of the high coverage individual sf12.
The basis for these comparisons are the whole genome sequence data of the SGDP panel (A and B) or SNP array genotype data from the HO panel (C and D). Comparisons are done between pseudo-haploid and diploid calls for sf12 (A and C), and between pseudo-haploid calls from short (35-40 bp) or long (75-80 bp) fragments (B and D). The x axis represents the geographic origin of population X.
Fig 4D statistics similar to Fig 3 for different parts of the reference genome depending on their geographic origin [44].
The x axis represents the geographic origin of population X.
Percentage of Neandertal ancestry (and standard errors) in sf12 using diploid and pseudo-haploid calls and different subsets of the human reference genome.
Parts of the genome of East Asian origin were excluded due to their small total size.
| Statistic | Full reference | European Reference | African Reference |
|---|---|---|---|
|
| 3.17 ± 0.47 | 3.73 ± 0.82 | 1.76 ± 1.01 |
|
| 2.51 ± 0.46 | 2.97 ± 0.81 | 1.04 ± 0.99 |
|
| 3.00 ± 0.44 | 3.44 ± 0.77 | 1.38 ± 1.01 |
|
| 2.34 ± 0.44 | 2.71 ± 0.76 | 0.72 ± 1.01 |
|
| 2.98 ± 0.47 | 3.43 ± 0.83 | 1.79 ± 1.02 |
|
| 2.34 ± 0.46 | 2.71 ± 0.82 | 1.15 ± 0.99 |
|
| 2.89 ± 0.44 | 3.18 ± 0.78 | 1.46 ± 1.02 |
|
| 2.19 ± 0.44 | 2.46 ± 0.77 | 0.88 ± 1.01 |
$ d and h denote diploid and pseudo haploid-calls, respectively
Fig 5Comparison of different post-mapping filtering strategies for high coverage bam files from anatomically modern humans employing mapping and base quality filters of 30.
(A) Average proportion of the alternative allele for the comparison between no additional filters (see also Fig 1), remapping of reads carrying the reference allele modified to carry the alternative allele (modified reads), remapping against a modified reference carrying a third allele at the SNP sites, and both filters together. (B) Influence of filtering on measures of heterozygosity for different fragment sizes in sf12. Error bars indicate two standard errors.