| Literature DB >> 35338232 |
Francesc Coll1, Theodore Gouliouris2,3, Sebastian Bruchmann4, Jody Phelan5, Kathy E Raven2, Taane G Clark5,6, Julian Parkhill4, Sharon J Peacock2.
Abstract
Genome-wide association studies (GWAS) are increasingly being applied to investigate the genetic basis of bacterial traits. However, approaches to perform power calculations for bacterial GWAS are limited. Here we implemented two alternative approaches to conduct power calculations using existing collections of bacterial genomes. First, a sub-sampling approach was undertaken to reduce the allele frequency and effect size of a known and detectable genotype-phenotype relationship by modifying phenotype labels. Second, a phenotype-simulation approach was conducted to simulate phenotypes from existing genetic variants. We implemented both approaches into a computational pipeline (PowerBacGWAS) that supports power calculations for burden testing, pan-genome and variant GWAS; and applied it to collections of Enterococcus faecium, Klebsiella pneumoniae and Mycobacterium tuberculosis. We used this pipeline to determine sample sizes required to detect causal variants of different minor allele frequencies (MAF), effect sizes and phenotype heritability, and studied the effect of homoplasy and population diversity on the power to detect causal variants. Our pipeline and user documentation are made available and can be applied to other bacterial populations. PowerBacGWAS can be used to determine sample sizes required to find statistically significant associations, or the associations detectable with a given sample size. We recommend to perform power calculations using existing genomes of the bacterial species and population of study.Entities:
Mesh:
Year: 2022 PMID: 35338232 PMCID: PMC8956664 DOI: 10.1038/s42003-022-03194-2
Source DB: PubMed Journal: Commun Biol ISSN: 2399-3642
Fig. 1Approach to bacterial GWAS power calculations.
Four steps were implemented to conduct power calculations. First, known or randomly sampled causal variants are chosen from existing genotypes, in the sub-sampling or phenotype simulation approach, respectively. In the latter, causal variants meeting a range of selected MAF and degree of homoplasy are selected. Second, phenotypes are either modified from existing ones (sub-sampling approach) or simulated from randomly selected genotypes (phenotype simulation approach) to achieve the range of chosen sample sizes and effect sizes (or heritability values). Third, a genome-wide association study (GWAS) is conducted for each combination of parameters and p-values of causal variant extracted. And forth, power is calculated as the proportion of GWAS replicates in which the causal variant is above the Bonferroni-corrected genome-wide significance threshold.
Bacterial species, strain collections and antibiotic susceptibility phenotypes used in this study.
| Bacterial species | Strain collection | # of isolates (diversity) | LD: median | # of SNP sites | # of genes in pan-genome | AMR phenotype (% R and S)a | AMR causal variants | AMR causal variants: AFb, OR and GWAS |
|---|---|---|---|---|---|---|---|---|
| Species-wide | 0.65 (0.37–0.95) | 263,875 | 11,800 | Kanamycin susceptibility (35.3%, 23.3%) | AF: 56.3% OR: 1083 | |||
| Single-clade | 0.50 (0.28–0.98) | 50,790 | 5443 | Streptomycin susceptibility (34.5%, 60.3%) | AF: 34% OR: 8986 | |||
| Species-wide | 0.67 (0.37–1.00) | 543,165 | 30,772 | Meropenem susceptibility (21%, 69.1%) | AF: 12% OR: 180 | |||
| Single-clade | 0.78 (0.50–0.96) | 46,541 | 23,708 | Meropenem susceptibility (95.4%, 1.3%) | AF: 72% OR: NAc | |||
| Species-wide | 0.86 (0.39–1.00) | 93,995 | 21,678 | Isoniazid susceptibility (30.9%, 66.4%) | nsSNPs in | AF: 20% OR: 220 | ||
| Single-cladee | 0.98 (0.40–1.00) | 24,467 | 10,130 | Isoniazid susceptibility (23.8%, 71.7%) | nsSNPs in | AF: 13% OR: 166 |
Summary table of strain collections used in this study. The average diversity (third column) was calculated as the mean pairwise genetic distance between isolates, expressed as number of SNPs per kilobase. The number of SNP sites in the chromosome (forth column; extracted from the VCF file) and number of genes in the pan-genome (fifth column; extracted from Panaroo’s output), both calculated across all isolates, indicate the degree of diversity within each collection. The last columns show the AMR phenotypes and causal variants used by the sub-sampling approach to perform power calculations. The single-clade collections correspond to: clade A1 isolates for E. faecium; CC258 isolates for K. pneumoniae; and lineage 4.3 isolates for Mycobacterium tuberculosis.
aThe percentage of resistant and susceptible isolates may not amount to 100%, as a subset of isolates were not tested.
bThe MAF was calculated in the whole population not in just the samples phenotyped for the antibiotic in question.
cThe unbalanced number of cases and controls prevented running GWAS.
SNPs/kb Single Nucleotide Polymorphisms per kilobase, AF allele frequency, OR odds ratio, nsSNP non-synonymous SNPs, LD linkage disequilibrium.
Fig. 2Power calculations obtained using the sub-sampling approach for the detection of AMR genes.
Results of running GWAS power calculations applying the sub-sampling approach for the detection of known AMR genotype-phenotype relationships (binary phenotype). These plots show the sample sizes required to detect AMR causal genes of different AF and effect sizes (for which full heritability is assumed). The y-axis shows the power, calculated as the proportion of GWAS replicates in which the causal AMR gene is above the Bonferroni-corrected genome-wide significance threshold. The black and dotted horizontal line marks 80% power. Sample sizes are represented in the x-axis. The colour of lines denotes different AF whereas point shapes and line types effect sizes in odds ratio units. The power calculation results presented here are those for the species-wide populations, see Supplementary Table 1 for sample sizes required in both species-wide and single-clade populations. Pan-genome GWAS was run to detect acquired AMR genes in E. faecium (a) and K. pneumoniae (b) populations. A burden test GWAS was applied to M. tuberculosis (c).
Sample sizes required to detect causal genes of different MAF and effect sizes in a pan-genome GWAS.
| Bacterial species | Strain collection | Gene frequency (%) | Effect size (odds ratio) | |||
|---|---|---|---|---|---|---|
| Small (1.5) | Moderate (5) | Large (10) | Very large (100) | |||
| Species-wide ( | 1 | – | – | – | – | |
| 2.5 | – | – | – | 1100 | ||
| 5 | – | 1000 | 600 | 500 | ||
| 10 | – | 500 | 400 | 200 | ||
| 25 | – | 200 | 200 | 100 | ||
| Single-clade ( | 0–1 | – | – | – | – | |
| 2.5 | – | – | 1400 | 1000 | ||
| 5 | – | – | – | – | ||
| 10 | – | 600 | 400 | 300 | ||
| 25 | – | 300 | 200 | 100 | ||
| Species-wide ( | 1 | – | – | – | – | |
| 2.5 | – | 2500 | 1600 | 1200 | ||
| 5 | – | 1500 | 1000 | 700 | ||
| 10 | – | 600 | 400 | 300 | ||
| 25 | – | 500 | 400 | 200 | ||
| Single-clade ( | 0–1 | – | – | – | – | |
| 2.5 | – | – | – | 1000 | ||
| 5 | – | 900 | 700 | 500 | ||
| 10 | – | 500 | 300 | 200 | ||
| 25 | – | 300 | 200 | 100 | ||
| Species-wide ( | 1 | – | – | – | – | |
| 2.5 | – | 2000 | 1300 | 1000 | ||
| 5 | – | 1100 | 700 | 500 | ||
| 10 | – | – | 900 | 500 | ||
| 25 | – | 300 | 200 | 100 | ||
| Single-clade ( | 0–1 | – | – | – | – | |
| 2.5 | – | – | – | 1000 | ||
| 5 | – | 900 | 700 | 500 | ||
| 10 | – | 500 | 300 | 200 | ||
| 25 | – | 300 | 200 | 100 | ||
MAF minor allele frequency, - non-detectable with 80% power.
Results of running GWAS power calculations applying the phenotype simulation approach (binary phenotype, full heritability assumed). This table shows the minimum sample sizes required to detect acquired genes of different effect sizes (in odds ratio units) and gene frequencies in a pan-genome GWAS with 80% power, in both species-wide and single-clade populations.
Sample sizes required to detect SNPs and mutated genes of different MAF and effect sizes.
| Bacterial species | Strain collection | MAF (%) | Variant GWAS | Burden GWAS | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Effect size (odds ratio) | ||||||||||
| Small (1.5) | Moderate (5) | Large (10) | Very large (100) | Small (1.5) | Moderate (5) | Large (10) | Very large (100) | |||
| Species-wide ( | 1 | – | – | – | – | – | – | – | – | |
| 2.5 | – | – | – | – | – | – | – | 1000 | ||
| 5 | – | – | – | – | – | – | – | – | ||
| 10 | – | – | 1200 | 700 | - | 900 | 700 | 400 | ||
| 25 | – | 1200 | 500 | 400 | – | 900 | 400 | 200 | ||
| Single-clade ( | 1 | – | – | – | – | – | – | – | – | |
| 2.5 | – | – | – | – | – | – | 1300 | 900 | ||
| 5 | – | – | – | – | – | – | 900 | 700 | ||
| 10 | – | 1200 | 800 | 500 | – | 1100 | 700 | 400 | ||
| 25 | – | 1100 | 700 | 400 | – | 600 | 300 | 200 | ||
| Species-wide ( | 1 | – | – | – | – | – | – | – | – | |
| 2.5 | – | – | – | 2000 | – | 2000 | 1300 | 1000 | ||
| 5 | – | 2000 | 1200 | 800 | – | 1000 | 700 | 500 | ||
| 10 | – | 800 | 600 | 500 | – | 700 | 400 | 300 | ||
| 25 | – | 300 | 200 | 100 | – | 400 | 200 | 200 | ||
| Single-clade ( | 1 | – | – | – | – | – | – | – | – | |
| 2.5 | – | – | – | – | – | – | – | 1100 | ||
| 5 | – | – | – | – | – | – | – | 800 | ||
| 10 | – | – | – | – | – | – | – | – | ||
| 25 | – | 900 | 600 | 300 | – | – | 700 | 400 | ||
| Species-wide ( | 1 | – | – | – | – | – | – | – | – | |
| 2.5 | – | – | – | – | – | 2000 | 1400 | 1000 | ||
| 5 | – | – | – | – | – | 1400 | 1000 | 700 | ||
| 10 | – | – | – | – | – | 1500 | 1200 | 600 | ||
| 25 | – | 1300 | 800 | 500 | – | 900 | 500 | 200 | ||
| Single-clade ( | 0–1 | – | – | – | – | – | – | – | – | |
| 2.5 | – | – | – | – | – | – | – | 1000 | ||
| 5 | – | – | – | – | – | – | 800 | 600 | ||
| 10 | – | 900 | 600 | 500 | – | – | 900 | 500 | ||
| 25 | NA | 400 | 200 | 100 | – | 600 | 300 | 200 | ||
MAF minor allele frequency, NA no variants available with that MAF, - non-detectable with 80% power.
Results of running GWAS power calculations applying the phenotype simulation approach (binary phenotype, full heritability assumed). This table shows the minimum sample sizes required to detect acquired variants (i.e. mutations in the bacterial chromosome) of different effect sizes (in odds ratio units) and MAF using a variant or burden test GWAS with 80% power, in both species-wide and single-clade populations. Supplementary Figs. 2 and 3 show the PowerBacGWAS plots from which the results in this table were extracted from MAF minor allele frequency, NA no variants available with that MAF, - non-detectable with 80% power.
Fig. 3Effect of degree of homoplasy on the power to detect SNPs obtained using the phenotype-simulation approach.
These plots show the sample sizes required to detect causal SNPs of different effect sizes (in odds ratio units, showed as different colours) and degrees of homoplasy (number of independent acquisitions, shown as different point shapes) when simulating binary phenotypes (full heritability assumed). The power calculation results presented here are those for SNPs of 10% MAF, in both species-wide (panels a, c, e) and single-clade populations (panels b, d, f), see Supplementary Table 4 for SNPs of different MAF. The power in Fig. 3e, i.e. for SNPs with 50–100 homoplasy steps in M. tuberculosis population, are particularly noisy due to the low number of SNPs in this population arising 50–100 times in the phylogeny (only 9 variants), which makes power estimates of such a small sample to fluctuate.