| Literature DB >> 30355664 |
Brendan Epstein1, Reda A I Abou-Shanab2, Abdelaal Shamseldin3, Margaret R Taylor2, Joseph Guhlin1, Liana T Burghardt1, Matthew Nelson1,2, Michael J Sadowsky1,2,4, Peter Tiffin5.
Abstract
Genome-wide association studies (GWAS) can identify genetic variants responsible for naturally occurring and quantitative phenotypic variation. Association studies therefore provide a powerful complement to approaches that rely on de novo mutations for characterizing gene function. Although bacteria should be amenable to GWAS, few GWAS have been conducted on bacteria, and the extent to which nonindependence among genomic variants (e.g., linkage disequilibrium [LD]) and the genetic architecture of phenotypic traits will affect GWAS performance is unclear. We apply association analyses to identify candidate genes underlying variation in 20 biochemical, growth, and symbiotic phenotypes among 153 strains of Ensifer meliloti For 11 traits, we find genotype-phenotype associations that are stronger than expected by chance, with the candidates in relatively small linkage groups, indicating that LD does not preclude resolving association candidates to relatively small genomic regions. The significant candidates show an enrichment for nucleotide polymorphisms (SNPs) over gene presence-absence variation (PAV), and for five traits, candidates are enriched in large linkage groups, a possible signature of epistasis. Many of the variants most strongly associated with symbiosis phenotypes were in genes previously identified as being involved in nitrogen fixation or nodulation. For other traits, apparently strong associations were not stronger than the range of associations detected in permuted data. In sum, our data show that GWAS in bacteria may be a powerful tool for characterizing genetic architecture and identifying genes responsible for phenotypic variation. However, careful evaluation of candidates is necessary to avoid false signals of association.IMPORTANCE Genome-wide association analyses are a powerful approach for identifying gene function. These analyses are becoming commonplace in studies of humans, domesticated animals, and crop plants but have rarely been conducted in bacteria. We applied association analyses to 20 traits measured in Ensifer meliloti, an agriculturally and ecologically important bacterium because it fixes nitrogen when in symbiosis with leguminous plants. We identified candidate alleles and gene presence-absence variants underlying variation in symbiosis traits, antibiotic resistance, and use of various carbon sources; some of these candidates are in genes previously known to affect these traits whereas others were in genes that have not been well characterized. Our results point to the potential power of association analyses in bacteria, but also to the need to carefully evaluate the potential for false associations.Entities:
Keywords: BSLMM; GWAS; Medicago; Sinorhizobium; bacteria; chip heritability; genetic architecture; genomics; linkage disequilibrium; rhizobium; symbiosis
Mesh:
Year: 2018 PMID: 30355664 PMCID: PMC6200981 DOI: 10.1128/mSphere.00386-18
Source DB: PubMed Journal: mSphere ISSN: 2379-5042 Impact factor: 4.389
FIG 1(A) Distribution of number of variants per LD group (at r2 ≥ 0.95), (B) distribution of genomic distance spanned by LD groups found on the chromosome or the megaplasmids (including only groups found only on one replicon), and (C) number of groups containing only PAVs, only SNPs, or both as well as the number of LD groups found within and across replicons. There were 22,057 SNPs and 10,674 PAVs that were not grouped with other variants and 9,501 LD groups with a median of three variants per group, and the largest group contained 6,970 variants. Half of all variants are in groups that contain ≤12 variants. Only variants used for association testing (minor allele frequency ≥ 5%, missingness ≤ 20%) were grouped.
Mean r2, a measure of nonindependence between segregating variants, is generally low between pairs of variants of different types or on different replicons, while the median size and spanned distance of LD groups is less on the megaplasmids than on the chromosome
| Variant type or location | Mean | No. ungrouped | No. of LD | Median no. of variants | Median LD group |
|---|---|---|---|---|---|
| All | 0.06 | 32,821 | 9,501 | 3 | N/A |
| SNPs only | 0.07 | 22,057 | 8,364 | 3 | N/A |
| PAVs only | 0.02 | 10,764 | 632 | 2 | N/A |
| Between SNPs and PAVs | 0.03 | N/A | 505 | 7 | N/A |
| Chromosome SNPs | 0.24 | 789 | 900 | 7 | 173,406 |
| pSymB SNPs | 0.05 | 13,671 | 4,478 | 3 | 518 |
| pSymA SNPs | 0.12 | 7,597 | 2,912 | 3 | 1,063 |
Spanned distance calculated only for LD groups with SNPs that were all on the same replicon.
FIG 2(A) Phenotypic distributions of the focal traits and (B) proportion of phenotypic variance explained (PVE) by relatedness among strains (i.e., the K-matrix) alone, as predicted by a linear mixed model, and by both relatedness and large-effect variants through a Bayesian sparse linear mixed model (BSLMM) implemented in GEMMA. PVE was calculated for all variants, only SNPs, and only PAVs. The gray lines indicate the lower 95% of the empirical null distributions from permuted data sets.
FIG 3Evaluation of the expected proportion of variance explained (PVE) for A17 biomass by the most strongly associated variants as determined by association testing and forward model selection. Panel A shows the cumulative PVE explained by the 10 most strongly associated variants (black line, more than 10 variants rarely explained more variation than expected by chance) as well as the cumulative PVE from each of 100 randomly permuted data sets that make up the empirical null distribution (gray lines). For A17 biomass, the actual data explain more variance than the permuted data; however, panel B shows that only the first 3 variants explain more of the residual PVE (i.e., after accounting for PVE of the previous variants) than expected by chance. In panel B, the vertical gray lines represent the lower 95% of the null distribution.
FIG 4The proportion of remaining phenotypic variance of the focal traits explained by adding each additional top variant, as in Fig. 3B.
Candidate genes tagged by variants in LD groups that explained more variation than expected based on the empirical null distribution (see Fig. S7 for QQ-plots)
| Trait | Replicon | Position | Annotation (MaGe |
|---|---|---|---|
| 2-Aminoethanol | pSymA | 52580 | |
| Formic acid | pSymA | 38256 | |
| Gentamicin | pSymA | 796714 | Transcriptional regulator, ROK family (SMEL_v1_mpb0963) |
| pSymA | 263510 | ||
| pSymA | 282760 | Putative aldehyde dehydrogenase (SMEL_v1_mpb0345) | |
| Spectinomycin | pSymA | PAV | Conserved protein of unknown function (SMEL_v1_mpb0259) |
| Streptomycin | Chrom. | PAV | Multisensor signal transduction histidine kinase (SMEL_v1_0575) |
| Desiccation | pSymB | 1161576 | Putative aldehyde or xanthine dehydrogenase (SMEL_v1_mpa1160) |
| A17 biomass | pSymA | 269841 | |
| 269869 | |||
| 270090 | |||
| 270096 | |||
| 270157 | |||
| 270283 | |||
| 270292 | |||
| pSymA | 271348 | ||
| pSymA | 274195 | Unannotated | |
| pSymA | 276359 | ||
| 276443 | |||
| 276563 | |||
| pSymB | 1231268 | ||
| pSymB | 1376015 | Diguanylate cyclase/phosphodiesterase (SMEL_v1_mpa1374) | |
| pSymB | 669804 | Sulfotransferase family (SMEL_v1_mpa0678) | |
| R108 biomass | pSymA | 305290 | |
| 305308 | |||
| 305353 | |||
| AMT | pSymA | 648346 | Diguanylate cyclase/phosphodiesterase (SMEL_v1_mpb0802) |
| 649133 | |||
| AP | pSymA | PAV |
http://www.genoscope.cns.fr/agc/microscope/home/index.php.
Annual mean temperature.
Annual precipitation.
Variants are sorted by genomic position, not ranking or LD group.
For most traits, phenotypic variance explained by genome-wide relatedness (“PVE LMM”) was greater than the phenotypic variance explained by just the top variants
| Trait | PVE top variants | PVE LMM |
|---|---|---|
| 2-Aminoethanol | 0.05 | 0.00 |
| Gentamicin resistance | 0.14 | 0.49 |
| Spectinomycin resistance | 0.43 | 0.50 |
| Streptomycin resistance | 0.34 | 0.58 |
| Annual mean temperature | 0.09 | 0.12 |
| Annual precipitation | 0.10 | 0.23 |
| Formic acid | 0.08 | 0.30 |
| Desiccation tolerance | 0.19 | 0.31 |
| A17 biomass | 0.33 | 0.74 |
| R108 biomass | 0.19 | 0.53 |
| R108 nodule number | 0.06 | 0.19 |
Maximum cumulative PVE among 1 to 25 variants chosen by model selection after subtracting the median of the empirical null distribution obtained from random permutations.
After subtracting median of the null distribution.