| Literature DB >> 30646839 |
Apostolos Dimitromanolakis1,2, Jingxiong Xu1,3, Agnieszka Krol1, Laurent Briollais4,5.
Abstract
BACKGROUND: Simulation of genetic variants data is frequently required for the evaluation of statistical methods in the fields of human and animal genetics. Although a number of high-quality genetic simulators have been developed, many of them require advanced knowledge in population genetics or in computation to be used effectively. In addition, generating simulated data in the context of family-based studies demands sophisticated methods and advanced computer programming.Entities:
Keywords: 1000 genomes; Linkage disequilibrium; NGS; Pedigree data; Sequencing; Simulation
Mesh:
Substances:
Year: 2019 PMID: 30646839 PMCID: PMC6332552 DOI: 10.1186/s12859-019-2611-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1a Overview of our simulation workflow (function names in parenthesis). b Generating related individuals in sim1000G by following recombination events
Fig. 2Running time of sim1000G when simulating a specific number of individuals and number of variants (timings include the simulation initialization time). The simulated region length does not affect the simulation time
Fig. 3Comparison of simulated genetic variants to their original population. a: Allele frequency comparison between the original genetic variants and the simulated ones. b: Decay of LD patterns for the original data and the 3 simulators tested. Each curve shows the average value of pairwise LD (r2) between genetic variants with respect to the distance between these variants
Power (α=0.05) for the SKAT test under the population stratification scenario and varying levels of stratification. 10 causal variants were selected in causal genes. n2: number of individuals of African descent out of 2000 individuals
|
|
|
|
| |
|---|---|---|---|---|
|
| ||||
| |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
| |
|
|
|
|
|
| ||||
| |
|
|
| |
| |
|
|
| |
| |
|
|
| |
| |
|
|
| |
Fig. 4Examples of 2 Q-Q plots of SKAT p-values under two simulation scenarios: a Significant population stratification and no covariates adjustment, (b) With covariate adjustment. Solid points: true causal genes. Total number of genes was 200 with 3 causal genes, each with 10 causal variants
Proportion of variants within each MAF range category. The MAF range we specified when simulating the data was [0.0005,0.01]
| N_cases = N_controls = 250 | N_cases = N_controls = 1000 | ||||||
|---|---|---|---|---|---|---|---|
|
|
| [0,0.0005) | [0.0005,0.01] | (0.01,0.5) | [0,0.0005) | [0.0005,0.01] | (0.01,0.5) |
| sim1000G | 9.00% | 89.10% | 1.80% | 0.10% | 98.60% | 1.30% | |
| 14.30% | 85.10% | 0.60% | 0.70% | 99.10% | 0.20% | ||
| 14.40% | 84.00% | 1.60% | 0.70% | 98.50% | 0.80% | ||
| simuPOP | 33.20% | 65.30% | 1.50% | 14.60% | 84.30% | 1.10% | |
| 27.90% | 70.00% | 2.10% | 11.40% | 86.70% | 1.90% | ||
| 28.20% | 69.20% | 2.70% | 11.70% | 86.80% | 1.60% | ||
Statistical power comparison
| N_cases = N_controls = 250 | N_cases = N_controls = 1000 | ||||||
|---|---|---|---|---|---|---|---|
|
|
| SKAT | Burden | SKAT-O | SKAT | Burden | SKAT-O |
| sim1000G | 21.00% | 19.00% | 27.00% | 70.00% | 53.00% | 76.00% | |
| 19.00% | 32.00% | 28.00% | 71.00% | 77.00% | 82.00% | ||
| 47.00% | 81.00% | 82.00% | 98.00% | 99.00% | 100.00% | ||
| simuPOP | 26.20% | 20.90% | 28.80% | 87.90% | 59.90% | 88.20% | |
| 28.70% | 20.80% | 30.50% | 92.40% | 56.30% | 91.70% | ||
| 66.80% | 69.90% | 78.40% | 100.00% | 99.70% | 100.00% | ||
Estimated type I error and power over 500 simulations for the association test
| s | 3 | 3 | 3 |
|---|---|---|---|
| 5.6 | 4.2 | 5.0 | |
| 35.7 | 53.2 | 60.0 | |
| 96.8 | 99.4 | 100.0 |