| Literature DB >> 29940846 |
Mario P L Calus1, Jérémie Vandenplas2.
Abstract
BACKGROUND: High levels of pairwise linkage disequilibrium (LD) in single nucleotide polymorphism (SNP) array or whole-genome sequence data may affect both performance and efficiency of genomic prediction models. Thus, this warrants pruning of genotyping data for high LD. We developed an algorithm, named SNPrune, which enables the rapid detection of any pair of SNPs in complete or high LD throughout the genome.Entities:
Mesh:
Year: 2018 PMID: 29940846 PMCID: PMC6019535 DOI: 10.1186/s12711-018-0404-z
Source DB: PubMed Journal: Genet Sel Evol ISSN: 0999-193X Impact factor: 4.297
Fig. 1Relationship between expected maximum values for values computed based on allele count () or phased alleles (). Pairs of values are indicated by black dots. The red line indicates as a reference
Number of pruned SNPs from the 52,843 SNPs present on the 60k pig SNP array
| Analysis | One stepa | One stepa pre-sorted on MAF | Two stepa pre-sorted on MAF | |
|---|---|---|---|---|
| Number of SNPs pruned out | Total | Total | High LD | Totalb |
| SNPrune | 9126 | 9126 | 5296 | 9126 |
| PLINK | 9038 | 9098 | 5283 | 9113 |
| Overlapc | 6792 | 6844 | 4939 | 8778 |
aEither the LD pruning is done in one step, or in two steps, where 3830 SNPs in complete LD with other SNPs are removed in the first step, and the remaining SNPs in high LD are removed in the second step
bIncluding the 3830 removed due to complete LD
cOverlap between SNPs pruned by SNPrune and PLINK
Numbers of SNP pairs for which r values were computed for the pig data
| Number of pairs of SNPs | Allele counts |
|---|---|
| Possibly > 0.99 | 9,898,092 |
| < 0.99 (partial sums) | 6,179,862 |
| Computed | 3,718,230 |
| Percentage (all SNPs)a | 0.27 |
| Percentage (after pruning for complete LD)b | 0.31 |
aNumber of computed values as percentage of the total number of r values
bNumber of computed values as percentage of the total number of r values after pruning for complete LD
Number of pruned SNPs from the 10,812,225 SNPs included in the simulated sequence dataset
| Pruning approach | Number of SNPs pruned out | Number of SNPs left | ||
|---|---|---|---|---|
| Complete LD | High LD | Total | ||
| SNPrune allele counts | 6,367,210 | 1,428,122 | 7,795,332 | 3,016,893 |
| SNPrune phased alleles | 6,366,971 | 1 428 725 | 7,795,696 | 3,016,529 |
| PLINK (w50a) NPb | NA | NA | 5,401,197 | 5,411,028 |
| PLINK (w500a) NPb | NA | NA | 7,547,118 | 3,265,107 |
| PLINK (w5000a) NPb | NA | NA | 7,740,937 | 3,071,288 |
| PLINK (w50000a) NPb | NA | NA | 7,750,558 | 3,061,667 |
| PLINK (w500000a) NPb | NA | NA | 7,752,008 | 3,060,217 |
| PLINK (w5000000a) NPb | NA | NA | 7,752,834 | 3,059,391 |
| PLINK (w50a) MLPc | NA | NA | 5,401,527 | 5,410,698 |
| PLINK (w500a) MLPc | NA | NA | 7,543,234 | 3,268,991 |
| PLINK (w5000a) MLPc | NA | NA | 7,741,279 | 3,070,946 |
| PLINK (w50000a) MLPc | NA | NA | 7,751,008 | 3,061,217 |
| PLINK (w500000a) MLPc | NA | NA | 7,752,485 | 3,059,740 |
aUsing a sliding window of 50, 500, 5000, 50,000, 500,000 or 5,000,000 SNPs
bThe values are computed between allele counts, considering no phasing (NP)
cThe values are computed between alleles that are phased based on maximum likelihood phasing (MLP)
Number of computed r values in the simulated sequence dataset using PLINK
| Window size | Step size | Number of computed |
|---|---|---|
| 50 | 5 | 2.65 × 109 |
| 500 | 50 | 2.70 × 1010 |
| 5000 | 500 | 2.70 × 1011 |
| 5000 | 5000 | 2.70 × 1012 |
| 500,000 | 50,000 | 2.70 × 1013 |
| 5,000,000 | 500,000 | 2.70 × 1014 |
aComputed as , where 10,812,225 is the total number of SNPs, is the step size (i.e. the size of the shift of the windows), and is the window size used
Number of pairs of SNPs for which r values were computed for the simulated sequence dataset
| Number of pairs of SNPs | Phased alleles | Allele counts |
|---|---|---|
| Possibly > 0.99 | 107,576,540,902 | 107,567,702,834 |
| < 0.99 (partial sums) | 61,142,300,573 | 61,152,664,161 |
| Computed | 46,434,240,329 | 46,415,038,673 |
| Percentage (all SNPs)a | 0.08 | 0.08 |
| Percentage (after pruning for complete LD)b | 0.47 | 0.47 |
aThe number of computed values as percentage of the total number of r values
bThe number of computed values as percentage of the total number of r values after pruning for complete LD
Fig. 2Computation time to prune the sequence data using SNPrune and PLINK with various settings
Fig. 3Distribution of distances between pairs of SNPs pruned from the sequence data that were located on the same chromosome