| Literature DB >> 20113476 |
Guimei Liu1, Yue Wang, Limsoon Wong.
Abstract
BACKGROUND: Human genome contains millions of common single nucleotide polymorphisms (SNPs) and these SNPs play an important role in understanding the association between genetic variations and human diseases. Many SNPs show correlated genotypes, or linkage disequilibrium (LD), thus it is not necessary to genotype all SNPs for association study. Many algorithms have been developed to find a small subset of SNPs called tag SNPs that are sufficient to infer all the other SNPs. Algorithms based on the r2 LD statistic have gained popularity because r2 is directly related to statistical power to detect disease associations. Most of existing r2 based algorithms use pairwise LD. Recent studies show that multi-marker LD can help further reduce the number of tag SNPs. However, existing tag SNP selection algorithms based on multi-marker LD are both time-consuming and memory-consuming. They cannot work on chromosomes containing more than 100 k SNPs using length-3 tagging rules.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20113476 PMCID: PMC3098109 DOI: 10.1186/1471-2105-11-66
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Datasets.
| datasets | #SNPs | #Rep SNPs | datasets | #SNPs | #Rep SNPs |
|---|---|---|---|---|---|
| ENCODE CEU | 7,221 | 2,484 | chr2 | 169,905 | 85,807 |
| ENCODE HCB | 6,430 | 2,286 | chr3 | 135,058 | 71,244 |
| ENCODE JPT | 6,216 | 2,196 | chr19 | 28,931 | 17,807 |
| ENCODE YRI | 7,963 | 4,408 | chr21 | 28,914 | 15,644 |
| chr1 | 149,716 | 78,893 | chr22 | 26,595 | 15,553 |
The "#Rep SNPs" column is the number of representative SNPs with merging window size of 100 k.
Comparison of running time and number of tag SNPs selected when pairwise LD are used.
| Running time (minutes) | #tag SNPs | ||||||
|---|---|---|---|---|---|---|---|
| FastTagger | LRTag | MultiTag | FastTagger | LRTag | MultiTag | ||
| ENCODE CEU | 0.95 | 0.003 | 0.016 | 10.4 | 2144 | 2127 | 2136 |
| ENCODE HCB | 0.95 | 0.003 | 0.014 | 7.5 | 2065 | 2055 | 2061 |
| ENCODE JPT | 0.95 | 0.003 | 0.013 | 6.6 | 1996 | 1990 | 1996 |
| ENCODE YRI | 0.95 | 0.004 | 0.008 | 41.6 | 4115 | 4107 | 4109 |
| chr1 | 0.95 | 0.076 | 0.242 | 26.2 | 62190 | 61988 | 63391 |
| chr2 | 0.95 | 0.088 | 0.293 | 30.2 | 66026 | 65822 | 67236 |
| chr3 | 0.95 | 0.070 | 0.222 | 25.1 | 55895 | 55713 | 56972 |
| chr19 | 0.95 | 0.015 | 0.032 | 3.6 | 14777 | 14744 | 15014 |
| chr21 | 0.95 | 0.015 | 0.040 | 6.0 | 12455 | 12435 | 12658 |
| chr22 | 0.95 | 0.014 | 0.033 | 7.9 | 12690 | 12652 | 12932 |
The running time of LRTag includes only tag SNP selection time, while the running time of FastTagger and MultiTag includes both rule generation time and tag SNP selection time. MMTagger is excluded from this table because the MMTagger program provided by its authors cannot use pairwise LD to find tag SNPs.
Comparison of running time and number of tag SNPs selected when multi-marker LD are used.
| Running time (minutes) | #tag SNPs | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Fast-COOC | MMTagger | Fast-1vsR | MultiTag | Fast-COOC | MMTagger | Fast-1vsR | MultiTag | |||
| ENCODE CEU | 2 | 0.95 | 0.038 | 0.041 | 0.048 | ≥10 hours | 1282 | 1282 | 1291 | 1371 |
| ENCODE HCB | 2 | 0.95 | 0.032 | 0.032 | 0.042 | ≥10 hours | 1305 | 1328 | 1308 | 1424 |
| ENCODE JPT | 2 | 0.95 | 0.029 | 0.028 | 0.038 | ≥10 hours | 1234 | 1258 | 1240 | 1349 |
| ENCODE YRI | 2 | 0.95 | 0.181 | 0.188 | 0.245 | ≥60 hours | 2575 | 2618 | 2579 | 2770 |
| chr1 | 2 | 0.95 | 1.13 | 5.84 | 1.40 | ≥7 days | 43202 | 43483 | 43306 | 43462 |
| chr2 | 2 | 0.95 | 1.32 | 7.21 | 1.63 | ≥7 days | 44135 | 44556 | 44225 | 49289 |
| chr3 | 2 | 0.95 | 1.14 | 5.11 | 1.41 | ≥7 days | 37881 | 38206 | 37952 | 39300 |
| chr19 | 2 | 0.95 | 0.176 | 0.343 | 0.218 | ≥30 hours | 11151 | 11192 | 11160 | 11747 |
| chr21 | 2 | 0.95 | 0.287 | 0.473 | 0.359 | ≥60 hours | 8543 | 8627 | 8564 | 9103 |
| chr22 | 2 | 0.95 | 0.370 | 0.567 | 0.468 | ≥100 hours | 8970 | 9025 | 8993 | 9533 |
| ENCODE CEU | 3 | 0.95 | 1.28 | 3.69 | 1.85 | ≥50 hours | 972 | 1017 | 1151 | 1244 |
| ENCODE HCB | 3 | 0.95 | 1.26 | 3.40 | 1.93 | ≥80 hours | 1003 | 1034 | 1170 | 1170 |
| ENCODE JPT | 3 | 0.95 | 1.06 | 2.74 | 1.60 | ≥50 hours | 958 | 1002 | 1129 | 1244 |
| ENCODE YRI | 3 | 0.95 | 11.6 | 36.7 | 17.4 | ≥14 days | 1848 | 1927 | 2165 | 2516 |
| chr1 | 3 | 0.95 | 34.9 | 137.3 | 49.6 | - | 35556 | 38185 | 40534 | - |
| chr2 | 3 | 0.95 | 42.9 | 166.9 | 60.8 | - | 35502 | 38372 | 41129 | - |
| chr3 | 3 | 0.95 | 39.3 | 154.6 | 55.5 | - | 30695 | 33041 | 35305 | - |
| chr19 | 3 | 0.95 | 4.34 | 16.6 | 6.25 | - | 9444 | 10032 | 10546 | - |
| chr21 | 3 | 0.95 | 9.91 | 37.7 | 14.4 | - | 6929 | 7404 | 7935 | - |
| chr22 | 3 | 0.95 | 16.5 | 65.3 | 24.4 | - | 7327 | 7788 | 8392 | - |
Fast-COOC represents the FastTagger algorithm using the co-occurrence model, and Fast-1vsR represents the FastTagger algorithm using the one-vs-the-rest model. max_size is the maximum number of SNPs on the left hand side of a tagging rule. For the MMTagger algorithm, we divided chr1, chr2 and chr3 into 10 chunks when max_size = 3, and ran MMTagger on each chunk, and then combined the results. For the MultiTag algorithm, we divided chr1, chr2 and chr3 into 20 chunks, chr19, chr21 and chr22 into 5 chunks when max_size = 3. When max_size = 3, MultiTag took too long to finish on the 6 chromosomes, so we did not get its results on the 6 chromosomes.
Memory usage of FastTagger and MMTagger.
| FastTagger | MMTagger | FastTagger | MMTagger | ||
|---|---|---|---|---|---|
| chr1 | 94.41 MB | - | chr19 | 30.29 MB | 657 MB |
| chr2 | 287.50 MB | - | chr21 | 74.99 MB | 1210 MB |
| chr3 | 119.72 MB | - | chr22 | 50.20 MB | 1216 MB |
The co-occurrence model is used in FastTagger. min_r2 = 0.95, max_size = 3.
The number of tagging rules generated under the two models using the FastTagger algorithm (min_r2 = 0.9).
| #rules | memory | ||||
|---|---|---|---|---|---|
| Fast-COOC | Fast-1vsR | Fast-COOC | Fast-1vsR | ||
| chr19 | 2 | 121,122 | 120,627 | 6.63 MB | 6.63 MB |
| chr21 | 2 | 169,864 | 168,936 | 11.43 MB | 11.43 MB |
| chr22 | 2 | 156,134 | 155,223 | 8.14 MB | 8.13 MB |
| chr19 | 3 | 1,421,519 | 377,773 | 38.69 MB | 13.29 MB |
| chr21 | 3 | 2,713,338 | 657,767 | 101.11 MB | 29.92 MB |
| chr22 | 3 | 2,590,826 | 573,738 | 67.28 MB | 19.21 MB |
Baseline algorithm: merging equivalent SNPs and pruning redundant rules, no skipping rules.
| time | #tag SNPs | mem | #rules | |
|---|---|---|---|---|
| chr19 | 4.34 | 9444 | 30.29 MB | |
| chr21 | 9.91 | 6929 | 74.99 MB | |
| chr22 | 16.5 | 7327 | 50.20 MB |
The co-occurrence model is used. max_size = 3, min_r2 = 0.95.
Baseline algorithm WITHOUT merging equivalent SNPs.
| time | #tag SNPs | mem | #rules | |
|---|---|---|---|---|
| chr19 | 31.4 | 9476 | 209.83 MB | |
| chr21 | 72.3 | 6959 | 555.42 MB | |
| chr22 | 90.5 | 7342 | 340.59 MB |
The co-occurrence model is used. max_size = 3, min_r2 = 0.95.
Baseline algorithm WITHOUT pruning redundant rules.
| time | #tag SNPs | mem | #rules | |
|---|---|---|---|---|
| chr19 | 4.24 | 9439 | 75.70 MB | |
| chr21 | 9.60 | 6942 | 191.86 MB | |
| chr22 | 15.8 | 7327 | 130.19 MB |
The co-occurrence model is used. max_size = 3, min_r2 = 0.95.
Baseline algorithm with skipping rules: if a SNP appears in the right hand side no less than 5 times, the SNP will not be considered as right hand side any more.
| time | #tag SNPs | mem | #rules | |
|---|---|---|---|---|
| chr19 | 3.66 | 9550 | 18.61 MB | |
| chr21 | 8.06 | 7086 | 40.74 MB | |
| chr22 | 13.5 | 7447 | 28.62 MB |
The co-occurrence model is used. max_size = 3, min_r2 = 0.95.
Performance of Fast-COOC when memory size is restricted to 50 MB (max_size = 3, min_r2 = 0.95)
| No memory constraint | mem = 50 MB | |||||
|---|---|---|---|---|---|---|
| time | #tag SNPs | mem | time | #tag SNPs | #chunks | |
| chr1 | 34.9 | 35556 | 94.41 MB | 35.14 | 35561 | 16 |
| chr2 | 42.9 | 35502 | 287.50 MB | 43.14 | 35518 | 21 |
| chr3 | 39.3 | 30695 | 119.72 MB | 39.3 | 30706 | 15 |
Figure 1Portability of length-1 rules. The rules are generated from the Han Chinese population with min_r2 = 0.9, and they are then validated on the other two datasets as well.
Figure 2Portability of length-2 rules. The rules are generated from the Han Chinese population with min_r2 = 0.9.
Figure 3Portability of length-3 rules. The rules are generated from the Han Chinese population with min_r2 = 0.9.
Average r2 and predication accuracy of rules of different length on three populations.
| #rules | average | average accuracy | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| len | model | HCB | JPT | CEU | HCB | JPT | CEU | HCB | JPT | CEU |
| 1 | pairwise | 85961 | 84123 | 69083 | 0.978 | 0.942 | 0.865 | 0.995 | 0.989 | 0.966 |
| 2 | co-occurrence | 1563176 | 1472654 | 1014934 | 0.965 | 0.878 | 0.745 | 0.993 | 0.977 | 0.938 |
| 2 | one-vs-the-rest | 1560181 | 1469765 | 1012699 | 0.965 | 0.881 | 0.753 | 0.993 | 0.977 | 0.940 |
| 3 | co-occurrence | 26182522 | 24495802 | 16064120 | 0.952 | 0.790 | 0.665 | 0.990 | 0.960 | 0.913 |
| 3 | one-vs-the-rest | 7074493 | 6269985 | 3955224 | 0.970 | 0.791 | 0.659 | 0.994 | 0.970 | 0.919 |
The rules are generated from Han Chinese population with min_r2 = 0.9. Some rules may become invalid in the other two populations because the MAF of some SNPs in the other two populations may be smaller than 5%. When only pairwise LD is used, all algorithms generate the same set of rules. When multi-markers are considered, FastTagger-COOC and MMTagger generate the same set of rules using the co-occurrence model; FastTagger-avsR and MultiTag generate the same set of rules using the one-vs-the-rest model.
Figure 4Prediction accuracy of length-1 rules. The rules are generated from the Han Chinese population with min_r2 = 0.9.
Figure 5Prediction accuracy of length-2 rules. The rules are generated from the Han Chinese population with min_r2 = 0.9.
Figure 6Prediction accuracy of length-3 rules. The rules are generated from the Han Chinese population with min_r2 = 0.9.