| Literature DB >> 32310745 |
Katie Saund1, Zena Lapp2, Stephanie N Thiede1, Ali Pirani1, Evan S Snitkin1,3.
Abstract
While variant identification pipelines are becoming increasingly standardized, less attention has been paid to the pre-processing of variants prior to their use in bacterial genome-wide association studies (bGWAS). Three nuances of variant pre-processing that impact downstream identification of genetic associations include the separation of variants at multiallelic sites, separation of variants in overlapping genes, and referencing of variants relative to ancestral alleles. Here we demonstrate the importance of these variant pre-processing steps on diverse bacterial genomic datasets and present prewas, an R package, that standardizes the pre-processing of multiallelic sites, overlapping genes, and reference alleles before bGWAS. This package facilitates improved reproducibility and interpretability of bGWAS results. prewas enables users to extract maximal information from bGWAS by implementing multi-line representation for multiallelic sites and variants in overlapping genes. prewas outputs a binary SNP matrix that can be used for SNP-based bGWAS and will prevent the masking of minor alleles during bGWAS analysis. The optional binary gene matrix output can be used for gene-based bGWAS, which will enable users to maximize the power and evolutionary interpretability of their bGWAS studies. prewas is available for download from GitHub.Entities:
Keywords: GWAS; data pre-processing; multiallelic loci; overlapping genes; reference allele; software
Year: 2020 PMID: 32310745 PMCID: PMC7371116 DOI: 10.1099/mgen.0.000368
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.prewas workflow. (a) Overview of the prewas workflow. Grey and colored boxes: processing steps. White boxes: output generated. (b) Multi-line representation of multiallelic sites. (c) Possible methods to find a reference allele. The ancestral allele method and the major allele method are implemented in prewas. (d) Grouping SNPs into genes.
Bacterial datasets
|
Name |
Samples (count) |
Multiallelic sites (count) |
Mean SNP distance (BP) |
SNPs in overlapping genes (count) |
Reference |
|---|---|---|---|---|---|
|
|
107 |
3527 |
18010.4 |
11 511 |
[ |
|
|
247 |
2460 |
6840.8 |
7862 |
[ |
|
|
152 |
118 |
2976.5 |
8 |
[ |
|
|
157 |
201 |
5960.1 |
20 |
[ |
|
|
453 |
920 |
3825.4 |
76 |
[ |
|
|
28 |
536 |
9501.5 |
34 |
[ |
|
|
150 |
296 |
5195 |
74 |
[ |
|
|
267 |
391 |
5561.4 |
38 |
[ |
|
|
149 |
3080 |
11243.4 |
32 594 |
[ |
Fig. 2.Prevalence and predicted functional impact of multiallelic sites. (a) The number of multiallelic sites increases as sample size increases until the total diversity of the dataset is sampled. (b) More diverse samples have relatively more multiallelic sites. (c) Counts of predicted functional impact (mis)matches for pairs of alleles at triallelic sites (aggregated across all datasets). Alternative alleles often differ in impact.
Fig. 3.Methods to determine the reference allele identify different alleles. (a) The fraction of variant positions where the identified reference allele varies between two methods. Only high-confidence ancestral reconstruction sites (>=87.5 % confidence in the ancestral root allele by maximum likelihood) are included. (b) Fraction of low-confidence ancestral reconstruction sites for each dataset (<87.5 % confidence in the ancestral root allele by maximum likelihood).
Fig. 4.SNPs in overlapping sites can have distinct functional impacts in each gene of the gene pair. The fraction of overlapping variant positions where the SNP has a different predicted functional impact in each of the two overlapping genes.