| Literature DB >> 34627144 |
Gwenna Breton1, Anna C V Johansson2, Per Sjödin3, Carina M Schlebusch3,4,5, Mattias Jakobsson6,7,8.
Abstract
BACKGROUND: Population genetic studies of humans make increasing use of high-throughput sequencing in order to capture diversity in an unbiased way. There is an abundance of sequencing technologies, bioinformatic tools and the available genomes are increasing in number. Studies have evaluated and compared some of these technologies and tools, such as the Genome Analysis Toolkit (GATK) and its "Best Practices" bioinformatic pipelines. However, studies often focus on a few genomes of Eurasian origin in order to detect technical issues. We instead surveyed the use of the GATK tools and established a pipeline for processing high coverage full genomes from a diverse set of populations, including Sub-Saharan African groups, in order to reveal challenges from human diversity and stratification.Entities:
Keywords: Comparison of pipelines; Genome Analysis Toolkit (GATK); High coverage genomes; High-throughput sequencing (HTS); Next generation sequencing (NGS); Underrepresented ancestry
Mesh:
Year: 2021 PMID: 34627144 PMCID: PMC8502359 DOI: 10.1186/s12859-021-04407-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
List of studies included in the literature survey
| Study | Species | Populations |
|---|---|---|
| [ | Human | Malay |
| [ | Human | Khoe-San |
| [ | Human | Worldwide |
| [ | Human | Dane |
| [ | Human | Icelandic |
| [ | Human | Japanese |
| [ | Human | UK |
| [ | Human | Qatari |
| [ | Human | Chadian, Greek, Lebanese |
| [ | Human | Aboriginal Australian |
| [ | Human | Worldwide |
| [ | Human | Worldwide |
| [ | Human | Swede |
| [ | Human | South African |
| [ | Human | Peruvian |
| [ | Human | Korean |
| [ | Human | Nepalese |
| [ | Human | US, Finn, Estonian |
| [ | Human | Japanese |
| [ | Human | Various African |
| [ | Human | Various African |
| [ | Human | North African, Basque, Iraqi |
| [ | Human | Worldwide |
| [ | Macaque | – |
| [ | Wolf, dog | – |
| [ | Dog | – |
| [ | Dog | – |
| [ | Macaque | – |
| [ | Green monkey | – |
Studies are ordered first by species (Human / other), then by date, and finally by alphabetical order of first author’s last name
Overview of the steps in 29 HTS studies
| Study | Indel realignment | BQSRBP | HCBP | UG | Other variant caller | VQSRBP | Hard filtering |
|---|---|---|---|---|---|---|---|
| [ | No | No | No | No | Yes | No | Yes |
| [ | Yes | No | No | No | Yes | No | Yes |
| [ | Yes | Yes | NA | NA | NA | NA | NA |
| [ | Yes | Yes | Yes | No | No | Yes | No |
| [ | Yes | Yes | Maybe1 | Maybe1 | No | No | Yes |
| [ | No | No | Yes | No | Yes | No | Yes |
| [ | Yes | Yes | No | Yes | Yes | Yes | No |
| [ | Yes?2 | Yes?2 | Maybe1 | Maybe1 | No | No | Yes |
| [ | NA | NA | No | No | Yes | NA | NA |
| [ | Yes | No | No | No | Yes | No | Yes |
| [ | No | No | No | Yes | No | No | No |
| [ | No | No | No | No | Yes | No | No |
| [ | No | Yes | Yes | No | No | Yes | No |
| [ | Yes (other)2 | Yes | Yes | No | Yes | Yes | Yes |
| [ | Yes | Yes | Yes | No | No | No | No |
| [ | Yes (NA)2 | Yes (NA)2 | No | Yes | No | No | Yes |
| [ | Yes | Yes | No | No | Yes | No | Yes |
| [ | No | Yes (other)2 | No | No | Yes | No | Yes |
| [ | No | No | Yes | No | No | Yes | Yes |
| [ | Yes | Yes | Yes | No | No | Yes | Yes |
| [ | No | No | No | Yes | No | No | No |
| [ | Yes | Yes | No | Yes | No | Yes | No |
| [ | Yes | Yes | No | Yes | No | Yes | No |
| [ | No | No | Yes | No | No | Yes | Yes |
| [ | Yes | Yes | No | Yes | No | No | Yes |
| [ | Yes | Yes | No | Yes | No | No | Yes |
| [ | No | Yes | Yes | No | No | Yes | No |
| [ | No | Yes | Yes | No | No | Yes | No |
| [ | Yes | No | Yes | No | No | No | Yes |
| [ | Yes | Yes | Yes | No | No | No | Yes |
BQSR Base Quality Score Recalibration, HC HaplotypeCaller + GenotypeGVCFs, VQSR Variant Quality score Recalibration, UG UnifiedGenotyper
BPGATK tool in the Best Practices in 2019
*Species other than human
#Reports using BQSR, HC + GenotypeGVCFs and VQSR
1Uncertainty as to which GATK variant caller was used (HC or UG)
2Various uncertainties or use of alternative software for indel realignment or BQSR
Fig. 1Three (plus one) BAM processing and variants calling pipelines. The dashed lines are relative to the comparisons mentioned in the text. Ind. = individuals
Metrics in “BP2019”, “BP2015”, and “3mask”, at callset and individual level, before and after VQSR
| Metrics | BP2019 | BP2015 | 3mask | |
|---|---|---|---|---|
| Before VQSR | Biallelic SNPs | 20,301,167 | 20,301,911 | 20,312,127 |
| Multiallelic SNPs | 85,510 | 85,517 | 85,725 | |
| Simple indels | 2,599,873 | 2,601,041 | 2,601,657 | |
| Complex indels | 737,325 | 738,010 | 737,834 | |
| Singletons | 7,975,044 | 7,974,844 | 7,980,292 | |
| Biallelic SNPs in dbSNP (%) | 96.68% | 96.68% | 96.67% | |
| Simple indels in dbSNP (%) | 94.30% | 94.30% | 94.29% | |
| After VQSR | Biallelic SNPs | 19,619,238 | 19,596,831 | 19,591,088 |
| Multiallelic SNPs | 75,300 | 75,132 | 75,115 | |
| Singletons (SNPs) | 6,930,326 | 6,921,952 | 6,923,568 | |
| Filtered SNPs | 692,139 | 715,465 | 731,649 | |
| Biallelic SNPs in dbSNP (%) | 96.87% | 96.88% | 96.88% | |
| Before VQSR | Biallelic SNPs (average) | 4,443,858.18 | 4,444,067.04 | 4,445,566.93 |
| Biallelic SNPs (stdev) | 438,889.45 | 438,945.07 | 438,990.07 | |
| Biallelic SNPs (min) | 3,442,414 | 3,442,476 | 3,443,131 | |
| Biallelic SNPs (max) | 4,916,206 | 4,916,439 | 4,917,928 | |
| Multiallelic SNPs (average) | 32,357.82 | 32,354.79 | 32,414.75 | |
| Multiallelic SNPs (stdev) | 3943.62 | 3943.33 | 3961.22 | |
| Simple indels (average) | 508,131.89 | 508,474.93 | 508,597.68 | |
| Simple indels (stdev) | 46,623.55 | 46,488.14 | 46,465.10 | |
| Complex indels (average) | 346,817.75 | 347,534.14 | 347,547.21 | |
| Complex indels (stdev) | 25,452.05 | 25,499.59 | 25,401.04 | |
| Singletons (average) | 284,823.00 | 284,815.86 | 285,010.43 | |
| Singletons (stdev) | 69,889.22 | 69,906.60 | 69,862.20 | |
| After VQSR | Biallelic SNPs (average) | 4,302,149.93 | 4,301,534.07 | 4,299,853.14 |
| Biallelic SNPs (stdev) | 429,408.28 | 428,949.56 | 428,861.99 | |
| Biallelic SNPs (min) | 3,393,736 | 3,394,064 | 3,393,934 | |
| Biallelic SNPs (max) | 4,744,903 | 4,743,518 | 4,741,343 | |
| Multiallelic SNPs (average) | 28,690.14 | 28,664.71 | 28,645.43 | |
| Multiallelic SNPs (stdev) | 3028.99 | 3022.11 | 3019.16 | |
| Singletons (SNPs, average) | 247,511.64 | 247,212.57 | 247,270.29 | |
| Singletons (SNPs, stdev) | 62,155.76 | 61,953.47 | 61,967.99 | |
| Filtered SNPs (average) | 145,375.93 | 146,223.04 | 149,483.11 | |
| Filtered SNPs (stdev) | 46,580.10 | 46,955.26 | 48,246.66 | |
Only SNPs are considered after VQSR
Fig. 2Most variants are common to “BP2019”, “BP2015” and “3mask”. Venn diagrams of the variants in “BP2019”, “BP2015” and “3mask” before VQSR (A and B) and after VQSR (C and D). The diagrams are not to scale. The percentages in parenthesis represent the percentage of all variants combined which are in the intersection. A All variant sites, before VQSR. B Biallelic SNPs, before VQSR. C All variant sites, after VQSR. D Bbiallelic SNPs, after VQSR
Fig. 3Differences in number of SNPs per individual are explained by dataset rather than ancestry. Boxplots of the difference between the number of SNPs per individual in “3maskvqsred” and “BP2019vqsred”, in percentage of “BP2019vqsred” (a negative percentage indicates more variants in “BP2019vqsred”). A Individuals are grouped by ancestry. B Individuals are grouped by dataset
Metrics in “3mask” and “3mask + 28”, at callset and individual level, before and after VQSR
| Metrics | 3mask | 3mask + 28 | |
|---|---|---|---|
| Before VQSR | Biallelic SNPs | 20,312,127 | 20,434,008 |
| Multiallelic SNPs | 85,725 | 94,044 | |
| Simple indels | 2,601,657 | 2,564,122 | |
| Complex indels | 737,834 | 816,453 | |
| Singletons | 7,980,292 | 8,123,791 | |
| Biallelic SNPs in dbSNP (%) | 96.67% | 96.67% | |
| Simple indels in dbSNP (%) | 94.29% | 94.21% | |
| After VQSR | Biallelic SNPs | 19,591,088 | 19,544,864 |
| Multiallelic SNPs | 75,115 | 79,945 | |
| Singletons (SNPs) | 6,923,568 | 6,902,425 | |
| Filtered SNPs | 731,649 | 903,243 | |
| Biallelic SNPs in dbSNP (%) | 96.88% | 96.96% | |
| Before VQSR | Biallelic SNPs (average) | 4,445,566.93 | 4,448,252.21 |
| Biallelic SNPs (stdev) | 438,990.07 | 440,005.98 | |
| Biallelic SNPs (min) | 3,443,131.00 | 3,442,837.00 | |
| Biallelic SNPs (max) | 4,917,928.00 | 4,922,667.00 | |
| Multiallelic SNPs (average) | 32,414.75 | 33,932.54 | |
| Multiallelic SNPs (stdev) | 3961.22 | 4451.22 | |
| Simple indels (average) | 508,597.68 | 490,154.68 | |
| Simple indels (stdev) | 46,465.10 | 46,446.32 | |
| Complex indels (average) | 347,547.21 | 370,585.18 | |
| Complex indels (stdev) | 25,401.04 | 30,851.51 | |
| Singletons (average) | 285,010.43 | 290,135.39 | |
| Singletons (stdev) | 69,862.20 | 70,494.15 | |
| After VQSR | Biallelic SNPs (average) | 4,299,853.14 | 4,305,202.11 |
| Biallelic SNPs (stdev) | 428,861.99 | 430,653.45 | |
| Biallelic SNPs (min) | 3,393,934.00 | 3,394,319.00 | |
| Biallelic SNPs (max) | 4,741,343.00 | 4,749,854.00 | |
| Multiallelic SNPs (average) | 28,645.43 | 29,668.04 | |
| Multiallelic SNPs (stdev) | 3019.16 | 3267.90 | |
| Singletons (SNPs, average) | 247,270.29 | 246,515.18 | |
| Singletons (SNPs, stdev) | 61,967.99 | 62,371.91 | |
| Filtered SNPs (average) | 149,483.11 | 147,314.61 | |
| Filtered SNPs (stdev) | 48,246.66 | 48,020.13 | |
Only SNPs are considered after VQSR. Metrics names in italics have been calculated by the authors (i.e. not an output of Picard’s CollectVariantCallingMetrics)
Fig. 4The total number of SNPs by individual is a function of coverage and ancestry. Total number of SNPs (bi- and multiallelic) per individual in “BP2019vqsred”. The y-axis starts at 3,400,000 SNPs. A Coloured by ancestry (the dots from a given ancestry are connected by lines). B Coloured by dataset
Fig. 5The total number of SNPs by individual in a larger dataset. Total number of SNPs (bi- and multiallelic) per individual in “3mask + vqsred”. The y-axis starts at 3,100,000 SNPs. Dots are coloured according to groups (the dots from a given group are connected by lines). The non-hunter-gatherer Sub-Saharan Africans are not shown as it is a very diverse group with respect to ancestry. RHG = Rainforest hunter-gatherers