| Literature DB >> 30545283 |
Roger Ros-Freixedes1, Mara Battagin2, Martin Johnsson2,3, Gregor Gorjanc2, Alan J Mileham4, Steve D Rounsley4, John M Hickey2.
Abstract
BACKGROUND: Inherent sources of error and bias that affect the quality of sequence data include index hopping and bias towards the reference allele. The impact of these artefacts is likely greater for low-coverage data than for high-coverage data because low-coverage data has scant information and many standard tools for processing sequence data were designed for high-coverage data. With the proliferation of cost-effective low-coverage sequencing, there is a need to understand the impact of these errors and bias on resulting genotype calls from low-coverage sequencing.Entities:
Mesh:
Year: 2018 PMID: 30545283 PMCID: PMC6293637 DOI: 10.1186/s12711-018-0436-4
Source DB: PubMed Journal: Genet Sel Evol ISSN: 0999-193X Impact factor: 4.297
Number of biallelic SNPs discovered on chromosome 1 with low and high sequencing coverage and percentage of overlap with the SNP genotyping array
| Low coverage | High coverage | |
|---|---|---|
| Number of variants | 1,333,943 | 1,693,308 |
| Overlap with high-coverage data | 96.9% | – |
| Overlap with low-coverage data | – | 76.3% |
| Overlap with the SNP genotyping arraya | 88.9% | 95.7% |
aRelative to the 5779 variants present in the SNP genotyping array GGP-Porcine HD BeadChip (GeneSeek, Lincoln, NE) that segregated in the 26 individuals tested
Number of biallelic SNPs discovered on chromosome 1 with low sequencing coverage with different GATK HaplotypeCaller pruning options, the percentage of variants not validated with high sequencing coverage, and genotype and allele concordances with the SNP genotyping array
| minPruning = 2 (default) | minPruning = 1 | |
|---|---|---|
| Number of variants | 1,333,943 | 1,877,644 |
| Not validated at high coverage | 3.1% | 24.1% |
| Best-guess genotype concordance | 62.1% | 76.5% |
| Allele concordance | 77.6% | 87.5% |
Concordance of best-guess genotype calls from sequence data with SNP array genotypes, using allele read counts obtained with the default settings of GATK HaplotypeCaller
| na | Genotype concordance (%) | Allele concordance (%) | Concordance by genotype (%) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| True = 0 | True = 1 | True = 2 | ||||||||||
| 0|0 | 1|0 | 2|0 | 0|1 | 1|1 | 2|1 | 0|2 | 1|2 | 2|2 | ||||
| Low coverage | ||||||||||||
| 1× | 27,185 | 42.2 | 61.0 | 99.97 | – | 0.03 | 95.14 | – | 4.86 | 81.96 | – | 18.04 |
| 2× | 33,638 | 57.2 | 76.0 | 99.94 | 0.00 | 0.06 | 72.87 | 3.51 | 23.62 | 20.07 | 0.25 | 79.68 |
| 3× | 24,789 | 70.3 | 84.5 | 99.91 | 0.08 | 0.01 | 56.37 | 31.87 | 11.76 | 6.23 | 1.45 | 92.32 |
| 4× | 14,015 | 79.7 | 89.6 | 99.85 | 0.13 | 0.02 | 43.11 | 51.44 | 5.46 | 2.14 | 1.69 | 96.16 |
| 5× | 6502 | 85.6 | 92.7 | 99.93 | 0.04 | 0.04 | 32.65 | 64.75 | 2.59 | 0.90 | 1.96 | 97.14 |
| 6–10× | 3705 | 90.5 | 95.2 | 99.83 | 0.12 | 0.06 | 22.47 | 74.68 | 2.85 | 0.61 | 1.07 | 98.32 |
| Overall | 109,834 | 62.1 | 77.6 | 99.92 | 0.04 | 0.03 | 66.41 | 21.50 | 12.09 | 29.84 | 0.71 | 69.45 |
| High coverage | 131,806 | 99.7 | 99.9 | 99.80 | 0.19 | 0.01 | 0.21 | 99.72 | 0.07 | 0.17 | 0.16 | 99.68 |
aNumber of genotypes called across 26 individuals at 5136 and 5531 SNPs for low- and high-coverage data, respectively
Concordance is shown by coverage at variant site
Concordance of best-guess genotype calls from sequence data with SNP array genotypes, using allele read counts obtained from aligned reads in BAM files
| na | Genotype concordance (%) | Allele concordance (%) | Concordance by genotype (%) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| True = 0 | True = 1 | True = 2 | ||||||||||
| 0|0 | 1|0 | 2|0 | 0|1 | 1|1 | 2|1 | 0|2 | 1|2 | 2|2 | ||||
| Low coverage | ||||||||||||
| 1× | 28,300 | 62.1 | 80.8 | 99.34 | – | 0.66 | 51.46 | – | 48.54 | 0.96 | – | 99.04 |
| 2× | 32,699 | 79.5 | 89.7 | 98.42 | 1.53 | 0.05 | 26.41 | 48.15 | 25.44 | 0.21 | 1.70 | 98.09 |
| 3× | 25,993 | 88.3 | 94.1 | 98.25 | 1.72 | 0.03 | 14.01 | 71.98 | 14.01 | 0.12 | 2.36 | 97.52 |
| 4× | 16,346 | 92.5 | 96.3 | 97.91 | 2.09 | 0.00 | 8.36 | 83.84 | 7.80 | 0.00 | 2.77 | 97.23 |
| 5× | 8878 | 94.9 | 97.5 | 97.28 | 2.72 | 0.00 | 4.83 | 91.15 | 4.02 | 0.16 | 2.81 | 97.03 |
| 6–10× | 6444 | 95.0 | 97.5 | 97.43 | 2.50 | 0.07 | 5.01 | 91.09 | 3.90 | 0.00 | 2.75 | 97.25 |
| Overall | 118,660 | 81.1 | 90.5 | 98.39 | 1.43 | 0.18 | 24.43 | 52.30 | 23.27 | 0.33 | 1.71 | 97.96 |
| High coverage | 131,782 | 99.8 | 99.9 | 99.80 | 0.19 | 0.01 | 0.12 | 99.81 | 0.07 | 0.10 | 0.17 | 99.73 |
aNumber of genotypes called for 5531 SNPs across 26 individuals both for low- and high-coverage data
Concordance is shown by coverage at variant site
Concordance between genotype calls with different levels of conservativeness from low- and high-coverage sequence data, using allele read counts obtained from aligned reads in BAM files
| na | Genotype concordance (%) | Allele concordance (%) | Concordance by genotype (%) | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| True = 0 | True = 1 | True = 2 | |||||||||||
| 0|0b | 1|0 | 2|0 | 0|1 | 1|1 | 2|1 | 0|2 | 1|2 | 2|2 | |||||
| Best-guess | |||||||||||||
| 1× | 30,875 | 62.6 | 81.1 | 99.45 | – | 0.55 | 51.35 | – | 48.65 | 0.70 | – | 99.30 | |
| 2× | 35,688 | 79.9 | 90.0 | 98.60 | 1.40 | 0.00 | 26.20 | 48.26 | 25.54 | 0.04 | 1.61 | 98.35 | |
| 3× | 28,357 | 88.6 | 94.3 | 98.36 | 1.64 | 0.00 | 13.80 | 72.20 | 14.00 | 0.00 | 2.22 | 97.78 | |
| 4× | 17,849 | 92.7 | 96.4 | 98.05 | 1.95 | 0.00 | 8.22 | 84.01 | 7.77 | 0.00 | 2.63 | 97.37 | |
| 5× | 9619 | 95.3 | 97.6 | 97.40 | 2.60 | 0.00 | 4.55 | 91.52 | 3.93 | 0.00 | 2.49 | 97.51 | |
| 6–10× | 7047 | 95.2 | 97.6 | 97.73 | 2.27 | 0.00 | 4.96 | 91.05 | 4.00 | 0.00 | 2.68 | 97.32 | |
| Overall | 129,435 | 81.4 | 90.7 | 98.53 | 1.34 | 0.13 | 24.27 | 52.40 | 23.33 | 0.18 | 1.61 | 98.21 | |
| Probability ≥ 0.90 | |||||||||||||
| 1× | 0 | – | – | – | – | – | – | – | – | – | – | – | |
| 2–3×b | 14,572 | 95.5 | 97.7 | – | 100.00 | – | – | 100.00 | – | – | 100.00 | – | |
| 4× | 14,359 | 92.7 | 96.3 | 99.97 | 0.03 | 0.00 | 16.24 | 68.39 | 15.37 | 0.00 | 0.05 | 99.95 | |
| 5× | 8315 | 96.3 | 98.2 | 99.92 | 0.08 | 0.00 | 6.76 | 87.41 | 5.83 | 0.00 | 0.10 | 99.90 | |
| 6–10× | 6397 | 98.1 | 99.0 | 99.83 | 0.17 | 0.00 | 3.08 | 94.65 | 2.27 | 0.00 | 0.29 | 99.71 | |
| Overall | 43,643 | 95.1 | 97.5 | 97.18 | 2.82 | 0.00 | 3.52 | 93.26 | 3.21 | 0.00 | 3.65 | 96.35 | |
| Probability ≥ 0.98 | |||||||||||||
| 1–3× | 0 | – | – | – | – | – | – | – | – | – | – | – | |
| 4–5×b | 4366 | 99.8 | 99.9 | – | 100.00 | – | – | 100.00 | – | – | 100.00 | – | |
| 6–10× | 6313 | 98.1 | 99.1 | 99.83 | 0.17 | 0.00 | 3.15 | 94.58 | 2.26 | 0.00 | 0.29 | 99.71 | |
| Overall | 10,679 | 98.8 | 99.4 | 99.65 | 0.35 | 0.00 | 1.00 | 98.28 | 0.72 | 0.00 | 0.57 | 99.43 | |
aNumber of genotypes called for 5,531 SNPs across 26 individuals
bHeterozygotes are easier to call than homozygotes; at these coverages, certainty is not sufficient to call the homozygotes, but note, that the actual counts for (1|0) and (1|2) are very low compared to (1|1): 19-fold and 569-fold lower for genotypes called with a probability greater than 0.90 and 0.98, respectively
Concordance is shown by coverage at variant site
Average allele read counts depending on which allele is in the reference genome
| Allelea | Reference genome | Allele in reference genomea | Overall | True genotype | ||
|---|---|---|---|---|---|---|
| 0 | 1 | 2 | ||||
| Reference | Original | Reference | 1.483 | 2.470 | 1.237 | 0.017 |
| Tailoredb | Alternative | 1.463 | 2.440 | 1.219 | 0.016 | |
|
|
|
|
|
| ||
| Alternative | Original | Reference | 0.980 | 0.014 | 1.217 | 2.407 |
| Tailoredb | Alternative | 0.993 | 0.016 | 1.234 | 2.438 | |
|
|
|
|
|
| ||
aAlleles are defined as reference or alternative allele based on the original pig reference genome Sscrofa11.1 (GenBank assembly accession: GCA_000003025.6)
bThe tailored reference genome was created by replacing the reference allele with the alternative allele at all variant sites discovered across the 26 individuals with the 30× sequence data from chromosome 1
cProportion of reads that did not align when the reference genome carried the opposite allele
Impact of bias towards the reference allele due to alignment on concordance between low- and high-coverage sequence data by alignment with the original reference genome (REF), the tailored reference genome (ALT), or a combination of both (CIS and TRANS)
| na | Genotype concordance (%) | Allele concordance (%) | Concordance by genotype (%) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| True = 0 | True = 1 | True = 2 | ||||||||||
| 0|0 | 1|0 | 2|0 | 0|1 | 1|1 | 2|1 | 0|2 | 1|2 | 2|2 | ||||
| Best-guess | ||||||||||||
| REF | 129,435 | 81.4 | 90.7 | 98.53 | 1.34 | 0.13 | 24.27 | 52.40 | 23.33 | 0.18 | 1.61 | 98.21 |
| ALT | 129,327 | 81.4 | 90.7 | 98.36 | 1.49 | 0.14 | 23.66 | 52.42 | 23.92 | 0.17 | 1.47 | 98.36 |
| CIS | 129,610 | 81.5 | 90.7 | 98.37 | 1.49 | 0.14 | 23.82 | 52.73 | 23.45 | 0.17 | 1.62 | 98.21 |
| TRANS | 129,148 | 81.3 | 90.6 | 98.52 | 1.34 | 0.13 | 24.11 | 52.10 | 23.79 | 0.18 | 1.46 | 98.36 |
| Probability ≥ 0.90 | ||||||||||||
| REF | 43,643 | 95.1 | 97.5 | 97.18 | 2.82 | 0.00 | 3.52 | 93.26 | 3.21 | 0.00 | 3.65 | 96.35 |
| ALT | 43,489 | 95.0 | 97.5 | 96.75 | 3.25 | 0.00 | 3.35 | 93.30 | 3.36 | 0.00 | 3.22 | 96.78 |
| CIS | 43,970 | 95.0 | 97.5 | 96.88 | 3.12 | 0.00 | 3.44 | 93.30 | 3.26 | 0.00 | 3.52 | 96.48 |
| TRANS | 43,145 | 95.1 | 97.6 | 97.10 | 2.90 | 0.00 | 3.42 | 93.28 | 3.30 | 0.00 | 3.32 | 96.68 |
| Probability ≥ 0.98 | ||||||||||||
| REF | 10,679 | 98.8 | 99.4 | 99.65 | 0.35 | 0.00 | 1.00 | 98.28 | 0.72 | 0.00 | 0.57 | 99.43 |
| ALT | 10,638 | 98.8 | 99.4 | 99.64 | 0.36 | 0.00 | 0.92 | 98.26 | 0.81 | 0.00 | 0.41 | 99.59 |
| CIS | 10,858 | 98.8 | 99.4 | 99.65 | 0.35 | 0.00 | 0.98 | 98.23 | 0.78 | 0.00 | 0.55 | 99.45 |
| TRANS | 10,463 | 98.8 | 99.4 | 99.64 | 0.36 | 0.00 | 0.94 | 98.31 | 0.75 | 0.00 | 0.43 | 99.57 |
aNumber of genotypes called for 5531 SNPs across 26 individuals
Estimates of index hopping incidence through concordance between low- and high-coverage sequence data in the real and simulated datasets, expressed as percentages and as isometric log-ratios
| Concordance by genotype (%) | Isometric log-ratios | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| True = 0 | True = 1 | True = 2 | 3 partsa | 2 partsb | ||||||||||
| 0|0 | 1|0 | 2|0 | 0|1 | 1|1 | 2|1 | 0|2 | 1|2 | 2|2 | 0|0 vs. 1|0, 2|0 | 1|1 vs. 0|1, 2|1 | 2|2 vs. 0|2, 1|2 | 0|0 vs. 1|0 | 2|2 vs. 1|2 | |
| Observed | 98.45 | 1.42 | 0.13 | 24.15 | 52.62 | 23.23 | 0.18 | 1.71 | 98.10 | 4.44 | 0.65 | 4.22 | 3.00 | 2.86 |
| Simulated | ||||||||||||||
| 0% | 99.62 | 0.35 | 0.03 | 23.57 | 52.98 | 23.45 | 0.04 | 0.47 | 99.48 | 5.61 | 0.66 | 5.35 | 4.00 | 3.78 |
| 0.1% | 99.53 | 0.44 | 0.03 | 23.59 | 53.52 | 22.89 | 0.08 | 0.52 | 99.40 | 5.47 | 0.68 | 5.06 | 3.83 | 3.72 |
| 0.5% | 99.28 | 0.66 | 0.06 | 23.91 | 53.22 | 22.87 | 0.10 | 0.92 | 98.98 | 5.07 | 0.67 | 4.72 | 3.55 | 3.31 |
| 1% | 98.99 | 0.90 | 0.10 | 23.70 | 53.23 | 23.07 | 0.14 | 1.33 | 98.53 | 4.72 | 0.67 | 4.43 | 3.32 | 3.04 |
| 2% | 98.20 | 1.64 | 0.16 | 23.73 | 52.90 | 23.37 | 0.23 | 2.16 | 97.62 | 4.29 | 0.66 | 4.04 | 2.89 | 2.70 |
| 5% | 96.34 | 3.29 | 0.37 | 23.56 | 53.37 | 23.07 | 0.59 | 4.75 | 94.66 | 3.65 | 0.68 | 3.29 | 2.39 | 2.12 |
| Regression | ||||||||||||||
| R2 | 0.999 | 0.998 | 0.999 | 0.014 | 0.044 | 0.014 | 0.989 | 1.000 | 0.999 | 0.993 | 0.213 | 0.981 | 0.995 | 0.989 |
| Estimate | 1.74 | 1.77 | 1.47 | – | – | – | 1.28 | 1.45 | 1.43 | 1.58 | – | 1.48 | 1.67 | 1.46 |
aThe 3-part isometric log-ratios take the form
bThe 2-part isometric log-ratios take the form
Impact of level of index hopping on concordance between low- and high-coverage sequence data
| Genotype concordance (%) | Allele concordance (%) | Concordance by genotype (%) | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| True = 0 | True = 1 | True = 2 | |||||||||
| 0|0 | 1|0 | 2|0 | 0|1 | 1|1 | 2|1 | 0|2 | 1|2 | 2|2 | |||
| Best-guess | |||||||||||
| 0% | 82.2 | 91.1 | 99.63 | 0.34 | 0.03 | 23.65 | 52.82 | 23.53 | 0.04 | 0.47 | 99.49 |
| 0.1% | 82.4 | 91.2 | 99.55 | 0.41 | 0.03 | 23.67 | 53.36 | 22.97 | 0.08 | 0.49 | 99.43 |
| 0.5% | 82.2 | 91.1 | 99.30 | 0.64 | 0.06 | 23.97 | 53.08 | 22.96 | 0.10 | 0.89 | 99.01 |
| 1% | 81.9 | 90.9 | 99.03 | 0.87 | 0.10 | 23.77 | 53.09 | 23.14 | 0.14 | 1.28 | 98.58 |
| 2% | 81.3 | 90.6 | 98.25 | 1.59 | 0.16 | 23.79 | 52.78 | 23.43 | 0.23 | 2.07 | 97.70 |
| 5% | 80.1 | 89.9 | 96.45 | 3.17 | 0.37 | 23.63 | 53.23 | 23.14 | 0.59 | 4.58 | 94.82 |
| Probability ≥ 0.90 | |||||||||||
| 0% | 96.9 | 98.5 | 99.20 | 0.80 | 0.00 | 2.46 | 94.98 | 2.55 | 0.00 | 1.16 | 98.84 |
| 0.1% | 96.7 | 98.3 | 99.01 | 0.99 | 0.00 | 2.59 | 94.70 | 2.71 | 0.00 | 1.26 | 98.74 |
| 0.5% | 96.4 | 98.2 | 98.39 | 1.61 | 0.00 | 2.59 | 94.75 | 2.66 | 0.00 | 1.98 | 98.02 |
| 1% | 96.2 | 98.1 | 97.87 | 2.13 | 0.00 | 2.68 | 94.99 | 2.33 | 0.00 | 2.98 | 97.02 |
| 2% | 95.4 | 97.7 | 96.36 | 3.64 | 0.00 | 2.59 | 94.85 | 2.56 | 0.00 | 4.92 | 95.08 |
| 5% | 93.2 | 96.6 | 92.53 | 7.47 | 0.00 | 2.65 | 94.78 | 2.57 | 0.00 | 10.55 | 89.45 |
| Probability ≥ 0.98 | |||||||||||
| 0% | 99.4 | 99.7 | 100.00 | 0.00 | 0.00 | 0.53 | 99.03 | 0.44 | 0.00 | 0.00 | 100.00 |
| 0.1% | 99.3 | 99.6 | 100.00 | 0.00 | 0.00 | 0.63 | 98.85 | 0.53 | 0.00 | 0.00 | 100.00 |
| 0.5% | 99.4 | 99.7 | 99.95 | 0.05 | 0.00 | 0.37 | 99.09 | 0.54 | 0.00 | 0.09 | 99.91 |
| 1% | 99.2 | 99.6 | 99.82 | 0.18 | 0.00 | 0.51 | 98.97 | 0.52 | 0.00 | 0.46 | 99.54 |
| 2% | 99.1 | 99.5 | 99.47 | 0.53 | 0.00 | 0.52 | 98.89 | 0.59 | 0.00 | 0.75 | 99.25 |
| 5% | 98.6 | 99.3 | 98.42 | 1.58 | 0.00 | 0.59 | 98.94 | 0.47 | 0.00 | 2.83 | 97.17 |