| Literature DB >> 36229780 |
Camille Clouard1, Kristiina Ausmees2, Carl Nettelblad2.
Abstract
BACKGROUND: Despite continuing technological advances, the cost for large-scale genotyping of a high number of samples can be prohibitive. The purpose of this study is to design a cost-saving strategy for SNP genotyping. We suggest making use of pooling, a group testing technique, to drop the amount of SNP arrays needed. We believe that this will be of the greatest importance for non-model organisms with more limited resources in terms of cost-efficient large-scale chips and high-quality reference genomes, such as application in wildlife monitoring, plant and animal breeding, but it is in essence species-agnostic. The proposed approach consists in grouping and mixing individual DNA samples into pools before testing these pools on bead-chips, such that the number of pools is less than the number of individual samples. We present a statistical estimation algorithm, based on the pooling outcomes, for inferring marker-wise the most likely genotype of every sample in each pool. Finally, we input these estimated genotypes into existing imputation algorithms. We compare the imputation performance from pooled data with the Beagle algorithm, and a local likelihood-aware phasing algorithm closely modeled on MaCH that we implemented.Entities:
Keywords: Genotyping; Imputation; Pooling
Mesh:
Year: 2022 PMID: 36229780 PMCID: PMC9563787 DOI: 10.1186/s12859-022-04974-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Fig. 2Markers data sets used for the study population in the pooling and classical imputation scenarios. a LD and HD markers data sets from intersecting Illumina bead-chips x 1KGP chromosome 20. b Missing genotypes repartition and values in a classical imputation scenario (1.), and in an imputation scenario from pooled data (2.) where the genotypes probabilities are estimated from the configurations of the pooling blocks
Fig. 3Examples of genotype pooling simulation at the block level. a Configuration with 1 sample carrying the minor allele. This carrier is identified after pooling, but not if it has a heterozygous (1) or a minor homozygous (2) genotype. b Configuration with 2 samples carrying the minor allele. At least 2 of the 4 samples highlighted in grey are minor allele carriers, but the genotypes of these 4 samples are indeterminate. The first step is encoding and pooling. Encoding assigns every sample to a pool and defines its pool coordinates. For instance in a, the sample at the top-left corner of the matrix has coordinates (1, 5). Pooling computes the genotype of a pool as if its would tested on a SNP-chip. Pool 5 (P5, most left) has genotype 1: both alleles 0 and 1 are detected among the samples. Pool 1 has genotype 0 because only the allele 0 is detected. The decoding step infers the pooled genotype of each sample from the genotypes of its coordinates. The genotype can be i.e. indeterminate when both coordinates have genotype 1, or fully determined else. In subfigure 3a, the sample with coordinates (3, 5) carries the alternate allele, but there can be 1 or 2 copies of it. is the observed pooling pattern that results from grouped genotyping, given as the number of row- and column-pools having the genotypes (0, 1, 2). In the example 3a, there are 3 row-pools having genotype 0, 1 row-pool having genotype 1 and 0 having genotype 2, likewise for the column-pools. c and d Simulation example of genotype pooling and imputation outcomes for markers from the 1KGP data (chromosome 20). The genotypes are represented as unphased GT. From top to bottom: true genotype data, pooled genotypes, imputed genotypes. c SNP 20:264365, . d SNP 20:62915126,
Fig. 1Experimental steps for creating the data sets in the pooling and classical imputation scenarios.The original data set “Chr20 x OmniExpress” consists of the genotype data of 2504 samples at 52,697 SNPs. The set of markers is created by intersecting the variants present on both bead-chips from the Illumina manufacturer and the data for the chromosome 20 in the 1KGP. The original data set is randomly split into a reference panel and a study population. In the LDHD scenario, all markers in the HD data set that are not present in the LD data set are filtered out in the study population. In the pooled HD scenario, the study samples are first assigned to blocks and pools, second the pools are genotyped at all markers in the HD data set, and last the genotype of each sample is decoded from the pools at every marker. See Fig. 3 for an example of the simulation steps in 1 block at 1 marker. The imputation step is performed in both scenarios from the reference panel, with Beagle on the one hand and Prophaser on the other hand. The genotyping accuracy in each scenario is computed by comparing the imputed genotypes with the true ones in the original data set for the study population
Markers counts and proportions on the LD and the HD maps per MAF bin
| MAF | 0.00–0.02 | 0.02–0.04 | 0.04–0.06 | 0.06–0.10 | 0.10–0.20 | 0.20–0.40 | 0.40–0.50 | Total |
|---|---|---|---|---|---|---|---|---|
| LD map (counts) | 520 | 779 | 673 | 1537 | 3969 | 6561 | 2976 | 17015 |
| HD map (counts) | 12775 | 5235 | 2823 | 4766 | 9009 | 12613 | 5476 | 52697 |
| LD map (%) | 0.987 | 1.478 | 1.277 | 2.917 | 7.532 | 12.450 | 5.647 | 32.288 |
| HD map (%) | 24.242 | 9.934 | 5.357 | 9.044 | 17.096 | 23.935 | 10.392 | 100 |
The counts are given in the two first rows of the table, the proportions in the two last ones. The proportions are given relatively to the total number of SNPs on the HD map. The HD map is on the whole 3 times denser than the LD map but the density is not uniformly increased over the MAF bins. Almost 25% of the markers on the HD map are very rare variants (), that is 25 times denser than on the LD map where they represent less than 1% of the markers
Exact genotypes in markers per data MAF bin
| MAF | 0.00–0.02 | 0.02–0.04 | 0.04–0.06 | 0.06–0.10 | 0.10–0.20 | 0.20–0.40 | 0.40–0.50 |
|---|---|---|---|---|---|---|---|
| Number before imputation | 520.000 | 779.000 | 673.000 | 1537.000 | 3969.000 | 6561.000 | 2976.000 |
| Beagle | 12699.362 | 5167.613 | 2776.687 | 4673.658 | 8804.892 | 12301.371 | 5337.921 |
| Prophaser | 12727.142 | 5193.438 | 2793.221 | 4705.346 | 8870.104 | 12396.258 | 5379.408 |
| Proportion before imputation | 0.041 | 0.149 | 0.238 | 0.322 | 0.441 | 0.520 | 0.543 |
| Beagle | 0.994 | 0.987 | 0.984 | 0.981 | 0.977 | 0.975 | 0.975 |
| Prophaser | 0.992 | 0.989 | 0.987 | 0.985 | 0.983 | 0.982 | |
| Number before imputation | 12534.608 | 4826.542 | 2396.671 | 3481.896 | 4249.592 | 1853.529 | 159.575 |
| Beagle | 12565.650 | 4892.246 | 2478.292 | 3778.296 | 5637.525 | 5407.479 | 1941.162 |
| Prophaser | 12755.854 | 5184.621 | 2758.079 | 4532.467 | 7964.742 | 9858.467 | 4012.725 |
| Proportion before imputation | 0.981 | 0.922 | 0.849 | 0.731 | 0.472 | 0.147 | 0.029 |
| Beagle | 0.984 | 0.935 | 0.878 | 0.793 | 0.626 | 0.429 | 0.354 |
| Prophaser | 0.990 | 0.977 | 0.951 | 0.884 | 0.782 | 0.733 | |
The number of markers is given as the average over all samples in the study population per bin. The proportion of markers is given relatively to the number of markers per bin. To the difference of concordance, only full matches with the true genotype are counted, not half-matches. For the LD + HD scenario, the number of exact genotypes before imputation is equal to the number of variants on the LD map. For the pooled HD scenario, the number of exact genotypes before imputation is equal to the average number of genotypes that are fully determined after pooling simulation. Simulating pooling followed by imputation with Prophaser yields a gain in accuracy for the very rare variants () which are almost all exactly genotyped. This gain is not negligible given the low occurence of these variations
The best accuracy scores achieved by Prophaser are marked in bold
Fig. 4Decoded and missing genotypes in data for both imputation scenarios.The minor and major alleles are denoted m and M. For simplicity, the simulated decoded genotypes from pooling are represented in GT format. We remind adaptive GL are provided later in the experiment for running imputation on data informed with the pooling outcomes. Half-decoded (GT = M/. or ./m) and not decoded (GT = ./.) genotypes are considered as missing data. The relative genotypes proportions are scaled in [0, 1] within each bin. a The markers only in the LD data set are fully assayed, all other markers have been deleted. b True heterozygous genotypes (dark blue) are never fully decoded, whereas the rare variants are almost all fully decoded or at least one of the two alleles is determined
Proportion of exact genotypes after imputation for indeterminate data in the pooled HD scenario per data MAF bin
| MAF | 0.00–0.02 | 0.02–0.04 | 0.04–0.06 | 0.06–0.10 | 0.10–0.20 | 0.20–0.40 | 0.40–0.50 |
|---|---|---|---|---|---|---|---|
| Prophaser | 0.886214 | 0.849634 | 0.820339 | 0.783430 | 0.745528 | 0.724745 | |
| Beagle | 0.124773 | 0.156686 | 0.187206 | 0.227121 | 0.287044 | 0.329487 | 0.334919 |
This table focuses on the genotypes that are indeterminate after the pooling simulation. The proportion is calculated for these markers only and relatively to the number of markers in the bin. For the very rare variants (), the indeterminate genotypes are the rare allele carriers. Phaser succeeds in imputing exactly most of them from the provided prior genotype probabilities estimates
The best accuracy scores achieved by Prophaser are marked in bold
Fig. 5Genotypes imputation accuracy in a classical and a pooled scenario. a and b concordance (based on best-guess genotype) c and d cross-entropy (based on posterior genotypes probability) metrics. All markers from the HD map have been used for computing the metrics (52,697 markers). Beagle (labeled as “beagle”) performance is in blue, and Prophaser (labeled as “phaser”) in orange. The central line is the median and the shadowed areas delimit the percentiles 0.0, 0.01, 0.25, 0.75, 0.99, 1.0. The x-axis was built from 0.05-long MAF bins within which each marker concordance score was computed as the mean score of the 500 previous and 500 next markers sorted per ascending MAF
| P | P | P | P | |
|---|---|---|---|---|
| P | G | G | G | G |
| P | G | G | G | G |
| P | G | G | G | G |
| P | G | G | G | G |