Literature DB >> 32636251

Improved Prediction of Bacterial Genotype-Phenotype Associations Using Interpretable Pangenome-Spanning Regressions.

John A Lees¹, T Tien Mai², Marco Galardini³, Nicole E Wheeler⁴, Samuel T Horsfield⁵, Julian Parkhill⁶, Jukka Corander^2,4,7.

Abstract

Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially.IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.

Entities: Chemical

Keywords: elastic net; pangenome; phenotype prediction

Mesh：

Year: 2020 PMID： 32636251 PMCID： PMC7343994 DOI： 10.1128/mBio.01344-20

Source DB: PubMed Journal: mBio Impact factor: 7.867

INTRODUCTION

Bacterial genomics has recently entered an era of “big data.” Single cohorts with 104 to 105 samples, 108 genetic variants, and corresponding extensive high-quality metadata are now publicly available (1, 2). In the context of bacterial populations, the challenge is to take large data sets consisting of whole genomes and metadata such as measured antimicrobial resistance or host disease status and identify genetic variants associated with antimicrobial resistance, host specificity, or virulence phenotypes. With enough independent observations, hypothesis-free machine learning methods can generate models which predict the phenotype of new isolates and potentially tell us something about the underlying genetic mechanisms. A deluge of recent papers have applied general predictive models to such data sets and have mostly showed high accuracy (3–8). However, some commentaries have been more cautious in their conclusions (9, 10). The overall problem of relating microbial genotype to phenotype has generally been approached by genome-wide association study (GWAS) methods (11–13). Such methods determine whether there is sufficient evidence to conclude that a specific variant explains some proportion of the variation of a trait, after accounting for as many statistical artifacts as possible. Generally, these methods start by taking the following approach: consider the association between a phenotype and a single variant, evaluate this association while accounting for known covariates, including population structure, and then “scan” this test along the whole genome one variant at a time. Ideally, associated sets of variants will be investigated further to find causally associated variants within these sets of linked loci (known as fine-mapping), but even in the best case, this is extremely challenging (14, 15). It is particularly crucial for microbial GWAS methods to incorporate a correction for population structure in each test, the specifics of which vary between methods and data sets. Further adjustment may be necessary if confounders are genetically stratified due to sampling strategy. Despite the required adjustments, the simplicity of this method is one of its great strengths: it is quick and easy to apply, understand, and visualize. Useful extensions which allow the estimation of heritability (16), the proportion of phenotypic variance explained by genotype, and prediction of phenotype (by forming linear predictors from significant variants [17]) are also relatively simple to implement. Large cohorts of bacterial sequences have also been a tempting target for machine learning and “deep learning” methods such as convolutional neural networks, which are able to relate arbitrary high-dimensional inputs to measured outputs with high accuracy and without the need for specialized model descriptions for each new problem (18). Rather than assessing evidence for association of individual variants as in GWAS, these methods instead aim to find a set of variants and a mapping to predict a trait as accurately as possible. They are potentially broadly applicable to any problem with vast amounts of data, though they perform best when the number of data points exceeds the number of dimensions. Unsurprisingly their uptake in sequence analysis (19, 20) and bacterial genomics specifically has been rapid (4, 5, 21). In general, the predictive variants in the identified sets may not fully overlap those significant by themselves, and the mapping does not necessarily lend itself readily to the same interpretation as P values in a GWAS. However, some issues crucial to understanding bacterial populations remain unaddressed. First, bacterial populations tend to exhibit a strong population structure, meaning samples cannot be treated as independent. In the context of prediction, this can result in the selection of features unrelated to the phenotype but common to the background of associated strains (lineage effects). While not necessarily a problem in the training data set, if new data are drawn from different strains, this can lead to much poorer prediction than expected. Examples of this effect are ubiquitous in artificial intelligence, for example, automatically designed clocks which only work in the lab they were built (22, 23). In human genetics, a similar problem has arisen due to an overrepresentation of samples of European ancestry in genotype databases. This has led to polygenic risk scores, which were originally thought to be highly accurate predictors of disease liability, to have significantly lower accuracy in non-European ancestry samples, which make up most of the global population (24). Additionally, these methods are unable to deal with missing input data. Variant calling, either by read pileup to generate single nucleotide polymorphisms (SNPs) or from a graph, when conducted separately for each population, is likely to produce disparate input sets, and with very different minor allele frequencies. Without using a method which produces consistent variant calls in test data sets, the accuracy of predictive models is likely to be heavily overstated. As well as methodological approaches, more representative sampling of the pathogen population which does not oversample clonal lineages may be advisable. Here, we set out to develop a method which combines the desirable attributes of both of these classes of approaches when analyzing the genetics underlying bacterial traits. We wished to retain the simplicity and interpretability of traditional GWAS approaches and combine this with the flexibility and accuracy of machine learning methods which can be fitted to the entire data set at once. These models were previously applied to human GWAS data sets for inference (25–27) and prediction (28). This pangenome-wide approach reflects the polygenic nature of complex traits better than older fixed effect methods which must select only some population structure covariates to include (29). Unlike marginal tests (the standard single predictor test in GWAS), a genome-wide regression approach gives rise to an increase of resolution when sample size increases, as was previously noted in human GWAS (30–32). Additionally, simultaneously analyzing predictors together in a regression model means that interactions and correlation between the predictors (e.g., population structure) may be included implicitly (33). Using large genomic data sets from four different species and sixteen varied phenotypes, we find that an elastic net model (33) selects similar variants to a GWAS and does not sacrifice its major advantage of quantitative model interpretability. Using simulated data, we demonstrate improved power, but an increase in false-positive rate, compared to that of linear mixed models. We illustrate this use in practice on antibiotic resistance phenotypes in two species and show further results which find similar accuracy between new machine learning and simpler approaches, consistent with previous studies (4, 5, 34). Our approach models the entire pangenome of the population to include the large proportion of variation which resides in the accessory genome. It explicitly addresses issues of population structure and consistent performance between trained and new (test) data sets. The method is broadly usable, not requiring programming knowledge or manual adjustments for new data sets, and allows for the sharing of models between researchers. We have implemented the elastic net model in the pyseer microbial package as a new “prediction” module and consistent pangenome variant calling in two further packages. An extensive tutorial for all of these methods is available online (https://pyseer.readthedocs.io/en/master/predict.html).

RESULTS

Method overview.

The elastic net uses a linear prediction model as with a standard linear regression run between a phenotype and all genetic variants and tries to find slopes (effect sizes) for all the variants which best predict the phenotype. A shrinkage term (λ) is used to prevent overfitting, adding a cost to each fitted slope proportional to its value. This has the effect of making many of the genetic variants have a slope of zero, and so they can be removed from the model entirely. To predict the phenotype in new samples, these fitted slopes form a simple linear model that can be applied to a new set of input variants. Furthermore, this also allows the calculation of R2, the variance in phenotype due to the variance of all genetic effects—the heritability (total effect of genetics) of the phenotype. We use an alignment-free representation variation as input. Alignment-free approaches have proven particularly popular in bacterial populations, removing the need for selection of a particular reference and simultaneously modeling both gene content and sequence variation (11, 12, 35). Many methods have previously used k-mers, which are short sequence words of length k. The DBGWAS method proposes connecting the overlaps of these k-mers in a compressed de Bruijn graph (DBG) so that k-mers are extended using adjacent sequence information in the population, forming unitigs present in exactly the same set of samples as their constituent k-mers (36). We followed the same approach here. As the elastic net includes predictors that are correlated due to population structure in the same model fitting, it will typically downweight these, to some extent incorporating this evolutionary history into its model. However, as we note below, effect size alone does not have the same interpretation as a P value from GWAS and must be considered along with minor allele frequency (MAF) and other confounders. To include a population structure explicitly in the model, we first divide the population into strains (or lineages). Each sample’s contribution to the fitting is then downweighted by the prevalence of the strain in the elastic net so that repeated observations of the same genotype count for less, known as “sequence reweighting.” We also use this to select the value shrinkage term λ. We designed a “leave-one-strain-out” (LOSO) cross-validation rather than randomly leaving samples out. This aims to avoid correctly predicting the strain (which is frequently correlated with phenotype) rather than the phenotype itself, since such an approach is less robust when common strains dominate the data or when the fitted population model is not representative of a population in which predictions will be made. When applied together, we refer to this as a “weighted” model, as opposed to “unweighted” models which use neither of these adjustments.

Prediction within and between cohorts without sacrificing model interpretability.

Whole-genome models can be used to construct a linear model to predict phenotypes in new data. In this section, we evaluate these predictions compared to those of other models and variant calling methods using a variety of data sets and phenotypes. (i) We first evaluated the predictive performance of our models, with and without population structure control, compared to that of a more complex deep learning model. We used an M. tuberculosis data set with antibiotic resistance to four first-line antibiotics (rifampin, isoniazid, ethambutol, and pyrazinamide). As M. tuberculosis has no accessory genome and minimal core gene variation (37), comparison with more complex models and a SNP alignment is possible. Previous work has evaluated the use of a multitask deep neural network and, when comparing this to lasso regression, found comparable accuracy (5). Using the same input of ∼6,500 SNPs and short insertion/deletions across the allele-frequency spectrum for 3,566 samples (split into training and test data sets) led to average false-negative rates of 2% ± 3% in the unweighted model and 3% ± 4% in the weighted model and false-positive rates of 11% ± 8% in the unweighted model and 12% ± 10% in the weighted mode. The elastic net therefore gives similar performance to the lasso as well as the more complex neural network (see Table S1 in the supplemental material), as was also shown by the original study authors (5). It is, however, much easier and faster to run on standard hardware (run time of <1s on a central processing unit [CPU] versus ∼3 min on a high-end graphics processing unit [GPU]) and gives results which are far more readily interpretable. Prediction accuracy on the Mycobacterium tuberculosis dataset. For resistance to each of the four front-line treatments, we fitted a model with and without sequence reweighting and compared accuracy on a uniform random test/training split. The results from the neural network in Chen et al. (5) (wide and deep neural network [WDNN]) are also included. TP, true positives; TN, true negatives; FP, false positives; FN, false negatives. The number of samples in each lineage is as follows: L1, 452; L2, 448; L3, 207; L4, 73. Download Table S1, PDF file, 0.1 MB. In this case, the weighted model generally performs slightly worse than an unweighted model. The population structure of this sample is relatively simple, with four distinct lineages. This is likely well captured implicitly by the unweighted model, and so the categorical weighting is of lower resolution. Sequence reweighting is instead expected to be more effective in data sets with more complex structures or when adding in samples which are genetically distant from the training set (38, 39), which we explore further in the next section. Applying these weights allowed us to easily see that the majority of errors occurred in lineage I, which has deep branches forming genetically separated subclades, with generally perfect prediction in the other three lineages. (ii) Prediction of pneumococcal resistance using different variant types. We also investigated the advantages of the use of unitigs over other variant calling methods. Using the same Streptococcus pneumoniae antimicrobial resistance in children (SPARC) data set described above for β-lactam and erythromycin resistance, we compared computational resources and prediction accuracy using SNPs, k-mers, and unitigs (Table 1).

TABLE 1

Predicting antibiotic resistance in the SPARC collection using different variant types

Variant type	Phenotype	No. selected	FPR (%)	FNR (%)	CPU time (min)	Memory usage (Gb)
SNPs (90,000), 3.6 Mb on disk	β-Lactam	4,374	3	7	4.4	1.3
SNPs (90,000), 3.6 Mb on disk	Erythromycin	2,341	3	63	4.1	1.3
Unitigs (730,000), 25 Mb on disk	β-Lactam	8,247	5	7	49.7	18
Unitigs (730,000), 25 Mb on disk	Erythromycin	1,591	9	39	52.6	6.9
k-mers (10 million), 603 Mb on disk	β-Lactam	15,121	6	7	420	212

Using a training/test split of 2:1, prediction accuracy of two phenotypes was tested using 90,000 SNP calls from mapping to a reference genome, and with 730,000 unitigs. We also tested prediction using 10 million variable-length k-mers to illustrate the heavy computational resource use in even a relatively small data set. File sizes are for the sparse data structures we employ.

Predicting antibiotic resistance in the SPARC collection using different variant types Using a training/test split of 2:1, prediction accuracy of two phenotypes was tested using 90,000 SNP calls from mapping to a reference genome, and with 730,000 unitigs. We also tested prediction using 10 million variable-length k-mers to illustrate the heavy computational resource use in even a relatively small data set. File sizes are for the sparse data structures we employ. We found that for β-lactam resistance, all three variant types gave similar predictive accuracy, with the elastic net able to select a small proportion of the total input variants in each case and apparently fairly insensitive to the far greater noise present in the higher dimensional variant types. As this resistance is due to allelic variation in core genes, we expect all three types to tag the causal variation equally well. For erythromycin, where causal variants are not all found in core genes, we observed a reduction in the false-negative rate when using unitigs. Computational usage increased roughly as NM (N, number of samples; M, number of variants). For common variants, M reaches an asymptote for a given population: the main requirement is therefore based on N. For all methods, the CPU time was modest, but memory usage may pose a problem. SNPs are tractable on a laptop, but unitig analysis likely requires a computing cluster for the model fitting (using a fitted model on test data requires negligible resources). k-mers require an enormous amount of memory, which would not scale to larger data sets. Though the unitig analysis was easy to schedule on our cluster, future improvements to reduce memory use could include accessing the variants as they are needed from a disk or fitting the elastic net in chunks, with resampling (40, 41). (iii) Reduced intercohort accuracy is ameliorated with consistent genetic calls and population structure control. Random splits of single data sets in test and training data, while convenient for analysis, may mask inter-data set differences such as class imbalance (different resistance rates), unobserved lineages, and technical errors (variant calling) (10). To test a more realistic example, where a previously fitted model is used to predict resistance status in new unobserved data, we set up a prediction experiment using genomic data from three large very different pneumococcal cohorts with β-lactam resistance: SPARC (603 U.S. children covering introduction of vaccine); Maela (3,162 unvaccinated infants and mothers); global pneumococcal sequence (GPS) (5,820 globally distributed samples, mostly vaccinated). We counted unitigs for each population and used these to train a predictive model. These models were evaluated on the data they were trained on and on the other two cohorts by using consistently named unitigs from unitig-caller (Table 2). The resources used were as follows: SPARC, 5 Gb random access memory (RAM), 0.6 h; Maela, 30 Gb RAM, 2.5 h; GPS, 3.1 h. The majority (∼80%) of the CPU time used was for reading variants from text files, making subsequent fitting faster. The distributions of unitig sizes are shown in Fig. S1.

TABLE 2

Comparison of intra- and intercohort prediction accuracy

Model	No. of selected unitigs (% in pbp genes)	Accuracy^b
		SPARC data			Maela data			GPS data
		FNR	FPR	R²	FNR	FPR	R²	FNR	FPR	R²
Sequence reweighting
SPARC	5,251 (10)	0.063	0.024	0.837	0.007	0.239	0.439	0.149	0.134	0.505
Maela	6,645 (14)	0.446	0.005	0.276	0.082	0.042	0.760	0.029	0.382	0.425
GPS	894 (4)	0.011	0.411	0.447	0.144	0.177	0.458	0.094	0.200	0.545
Without weighting
SPARC	7,261 (10)	0.040	0.013	0.901	0.012	0.163	0.487	0.165	0.130	0.487
Maela	8,705 (9)	0.397	0.011	0.339	0.063	0.036	0.805	0.049	0.322	0.449
GPS	7,511 (2)	0.050	0.152	0.656	0.319	0.026	0.452	0.129	0.037	0.864

For each prediction, the error rates are listed along with overall R2. For SPARC and Maela, phenotype was binary (resistant/sensitive); for GPS, phenotype was continuous (MIC). Where conversion was needed, we applied the standard breakpoint of MIC > 0.12 mg/liter for resistance.

Shaded cells are within-cohort. FNR, false-negative rate; FPR, false-positive rate.

Comparison of intra- and intercohort prediction accuracy For each prediction, the error rates are listed along with overall R2. For SPARC and Maela, phenotype was binary (resistant/sensitive); for GPS, phenotype was continuous (MIC). Where conversion was needed, we applied the standard breakpoint of MIC > 0.12 mg/liter for resistance. Shaded cells are within-cohort. FNR, false-negative rate; FPR, false-positive rate. Histograms of the unitig length distributions in S. pneumoniae data sets. Minimum unitig size is by definition the de Bruijn graph k-mer size, which was chosen to be 31 for all datasets. Download FIG S1, PDF file, 0.4 MB. Between-cohort predictive accuracy was considerably lower than within-cohort accuracy but still outperformed an intercept-only model in all cases. The use of unitigs proved successful: repeating the SPARC-Maela comparisons with SNPs led to extremely poor predictions for every sample, as the selected SNPs were called as missing in the other cohort, leading to a mean value prediction for every sample (true negatives, 1,661; false positives, 0; false negatives, 1,282; true positives, 0). To fix this issue with SNPs would likely require a labor-intensive mapping and joint recalling of variation, whereas with sequence elements, the simple search implemented in unitig-caller can be used. To deal with missing unitig data, which may truly be missing or miscalled, we assumed it was all truly missing. On the same SPARC-Maela comparison (false-negative rate [FNR], 0.007; false-positive rate [FPR], 0.239; R2, 0.439), this gave very similar results to those using mean allele frequency [AF] imputation for missing unitigs (FNR, 0.008; FPR, 0.232; R2, 0.453). Depending on the specific model and data set combination, errors can much more commonly be type I or type II, possibly reflecting class imbalance, despite overall resistance rates in the pneumococcus being stable (42). The GPS cohort gave the worst performing model, despite it being the largest collection. This is a very genetically diverse sample, which introduces more potential for confounding lineage effects to enter the model. Furthermore, this cohort is a mix of sequences isolated from cases of asymptomatic carriage and disease, whereas SPARC and Maela contain only asymptomatic carriage cases. The GPS cohort is enriched with more-virulent strains, which have more frequently faced treatment with antibiotics and have a higher rate of resistance (1, 43). We also note that the area under the curve (AUC) of the receiver operating characteristic (ROC) is misleadingly high (0.9185/0.9728 for the weighted/unweighted GPS model) and would encourage the reporting of error rates as more intuitive summaries of accuracy for bacterial traits such as resistance. We found that sequence reweighting generally reduced prediction accuracy for this phenotype, although it is the LOSO strategy in particular which gave slightly more representative accuracy estimates for out-of-cohort prediction (when comparing within data set with between data set prediction, R2 was 72% higher with sequence reweighting versus 82% higher without sequence reweighting), and more of the selected variants were in the causal loci. (iv) Virulence phenotypes can be predicted with sequence reweighting, preventing overestimation of accuracy. Most work on prediction of bacterial phenotypes has focused on antibiotic resistance, but many more complex phenotypes relating to bacterial virulence are now available. For these phenotypes, which are under weaker or no selection, instead of a few strong effects, multiple smaller effects are expected in the genome (44–46). Therefore, a model which may include more of these effects, which would be missed with a P value threshold, may be expected to perform well. We applied our method to predict the duration of asymptomatic carriage in a subset of the Maela cohort, which can easily be visualized in the manner of a linear regression (Fig. 1). We show the observed versus predicted values for the training and test sets, both with and without sequence reweighting. In the unweighted training set, R2 (and heritability [h2]) was 0.89 (Fig. 1, top left), but the test R2 was only 0.27 (Fig. 1, bottom left), showing clear overfitting. With sequence reweighting and LOSO, the training and test estimates were much closer (0.37 and 0.28, respectively) (Fig. 1, right). In this case, sequence reweighting gave a more realistic heritability estimate. h2 was previously estimated to be 0.634 using phylogenetic pairs, and 0.445 using restricted maximum likelihood (REML)—these may be overestimates, especially as the revised estimates introduced here used more information from the genome.

FIG 1

Prediction of carriage duration and the effect of sequence reweighting on heritability estimation. For the same training/test split, each panel shows observed log(carriage duration) values on the x axis and model-predicted values on the y axis, with a fitted linear regression. (Left) Unweighted model on the training data (top) and test (bottom). (Right) The same for the model with sequence reweighting. We also tested virulence prediction in two streptococcal species, which would be a useful application for routine pathogen surveillance but has not been as thoroughly explored as resistance prediction. Using Streptococcus pneumoniae isolated from Dutch adults, we fitted a model which selected 9,701 unitigs. This model was able to predict meningitis versus carriage genomes (test FPR, 0.059; FNR, 0.12) and gave a similar h2 estimate to that originally reported (0.65 versus 0.70). Comparing tissue infection with carriage of Streptococcus pyogenes gave a model with 5,817 unitigs, which had a higher error rate than the S. pneumoniae model (test FPR, 0.24; FNR, 0.25), but the phenotype also had a correspondingly lower h2 estimate of 0.343. Detailed performance of these models is given in Table S2. Prediction accuracy of the elastic net in other binary datasets, all of which are S. pneumoniae. Download Table S2, PDF file, 0.1 MB.

Power and false-discovery rate compared to GWAS using simulated phenotypes.

To test the characteristics of the elastic net compared to GWAS approaches, we simulated phenotype data from the Maela population (3,162 S. pneumoniae genomes) (Table 3) using previously defined SNP variation (47). We chose either 5, 25, 100, or 300 true causal variants with an effect size of 4 (similar to a penetrant antimicrobial resistance variant) either

TABLE 3

Summary of data sets tested

Data set name	Species	Phenotype(s) and split	Reference	No. of samples	No. of samples for training/test	No. of genetic features
TB	Mycobacterium tuberculosis	First-line antibiotic resistance: rifampicin, 1,285:2,257; isoniazid, 1,553:2,011; pyrazinamide, 702: 2,445; ethambutol, 975:2,551	5	3,566	2,377/1,189	6,400 (SNPs)
N. gonorrhoeae	Neisseria gonorrhoeae	Antibiotic resistance MICs: azithromycin, cefixime, ciprofloxacin, penicillin, and tetracycline	53, 61, 83, 84	1,595	NU^b	550,000 (unitigs)
GAS	Streptococcus pyogenes	Virulence, 1,093:637	46	1,730	1,154/576	1.1 million (unitigs)
SPARC	Streptococcus pneumoniae	Antibiotic resistance MICs: penicillin, erythromycin	47, 85	603	400/203	90,000 (SNPs), 730,000 (unitigs), 10 million (k-mers)
Maela	Streptococcus pneumoniae	Carriage duration; antibiotic resistance: penicillin, 1,661:1,282; erythromycin, 802:2,355; trimethoprim, 609:2,548	12, 44	3,162 (antibiotic resistance), 2,017 (carriage duration)	1,404/703 (carriage duration)	121,000 (SNPs), 1.6 million (unitigs)
GPS	Streptococcus pneumoniae	Antibiotic resistance (penicillin)	1	5,820	NU	1.7 million (unitigs)
Netherlands	Streptococcus pneumoniae	Meningitis/carriage, 693:1,144	45	1,837	1,225/612	690,000 (unitigs)

Each data set has a name by which it is referred to in the text. Most data sets have multiple phenotypes available, especially where multiple different antibiotic resistances are routinely phenotyped. Data sets without a training/test split were not evaluated for internal prediction ability as they were instead used with more stringent external validation data sets or were used for GWAS only, and all available samples were used to fit the model.

NU, not used.

chosen uniformly at random across the genome, after linkage disequilibrium (LD) pruning (no variants with R2 > 0.9). chosen uniformly at random from 1 to 3 prespecified genes (pbpX-pbp2x, penA-pbp2b, and pbp1a). Summary of data sets tested Each data set has a name by which it is referred to in the text. Most data sets have multiple phenotypes available, especially where multiple different antibiotic resistances are routinely phenotyped. Data sets without a training/test split were not evaluated for internal prediction ability as they were instead used with more stringent external validation data sets or were used for GWAS only, and all available samples were used to fit the model. NU, not used. We chose the first setting to emulate a polygenic trait, with many variants of roughly equal effects associated across the genome. LD pruning was only used to select causal variants, and all variants were used as input to the model. The second setting more closely resembles antibiotic resistance, where multiple alleles in either one or a small number of genes contribute to the effect, with multiple occurrences independent of genetic background. We ran the elastic net (α = 0.01) and lasso regression (α = 1) as well as both GWAS models (fixed effects and linear mixed model) previously implemented in pyseer. Variants were output by the genome-wide model if they had a nonzero coefficient and by the GWAS models if their P value exceeded a significance threshold of 0.05 after Bonferroni correction. For each simulated data set and method, counting true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), we calculated the power, the proportion of true causal variants in the output: TP/(TP + FN). This allowed us to analyze the overlap of selected variables which gave good prediction with those found in a GWAS, which are individually associated with a phenotype of interest and may therefore provide mechanistic insight into the trait being analyzed. We also calculated the proportion of false positives in the output, the number of variants selected in the output which are not true causal variants divided by the total number of variants being tested: FP/(TP + FP + FN + TN). This is a measure of the “noise” in the selected predictor set. First, our simulations were able to show that using the correlation filtering step (reducing input size by 25%) reduced power on average by 4% and 8% in the worst case (see Table S3), where many small-effect variants are spread across the genome and with no appreciable power loss with smaller or more concentrated causal variants. The sample correlation values are positively skewed due to population structure, and so filtering all variants with a sample correlation below the mean value rather than a quantile leads to an unacceptably high loss of power, as many causal variants would be removed. This quantile filter can therefore be used on large data sets to reduce CPU and memory usage with little effect on the variants selected, but if possible, the full set of variants should generally be modeled in its entirety. True causal variants retained after filtering on sample correlation. In parenthesis is the total number of the retained variants. (a) Varied sample sizes, uncorrelated true variants are chosen from across the genome. (b) Varied sample sizes, variants chosen from pbpX. (c) Varied sample sizes, variants from both regions. (d) Varied heritability, variants from both regions. Download Table S3, PDF file, 0.1 MB. Over all of our simulations, we found that either the elastic net or a fixed-effect GWAS had the highest power depending on the setting, and both always had higher power than the linear mixed model (Fig. 2 and Fig. S2). This is consistent with expectations from prior literature in multicellular organisms (25, 26, 40). The elastic net performed better in situations where the heritability was low or causal variants were spread out across the genome. This is expected for less penetrant traits such as carriage duration (44) and transmissibility (48). There was slightly lower power for all methods with binary phenotypes, and this decrease was more pronounced in the linear mixed model, possibly due to being the only model that used a Gaussian error structure in both settings.

FIG 2

Power and false-positive rates in the simulation study set up to resemble antibiotic resistance genotype-phenotype architecture. (Top) The effect of sample size, with 100 causal variants in the pbp2x gene and a binary phenotype. (Bottom) The effect of phenotype heritability, with 50 causal variants spread across the three pbp genes and a continuous phenotype. Multivariate methods tested were the elastic net with default α (red) and Lasso regression (orange). Univariate methods were the fixed effects/seer model (blue) and the FastLMM linear mixed model (green). Power and false-positive rates of GWAS and elastic net regression methods. (a) Different numbers of true variants were chosen from the pbpX gene, sample size was varied. (b) Different numbers of true variants were chosen from across the genome, sample size was varied. (c) Fifty true variants were chosen from different set ups: LD-pruned variants across the genome, and the genes pbpX, pbp1a and penA. Sample size was varied. (d) Fifty true variants were chosen from different set ups: LD-pruned variants across the genome, pbpX, pbp1a, and penA. Sample size was varied. (e) Varied heritability of a continuous phenotype. Fifty true variants were chosen from one or two genes. Sample size was fixed at 3,000. (f) Varied heritability of a binary phenotype. Fifty true variants were chosen from one, two, or three genes. Sample size was fixed at 3,000. Download FIG S2, PDF file, 1.0 MB. However, in exchange for reduced power, the linear mixed model consistently showed the best control of false positives in all settings, always <5%. In contrast, the fixed-effect model had a much greater false-positive rate than any other method, which grew both with sample size and heritability. The elastic net’s false-positive rate was typically <5% and was robust across the ranges of heritability tested, though it increased slightly with larger sample sizes as more variants were included in the fitted model. With such a large number of variants, even a small false-positive rate can be problematic, and so combining selected variants with a ranking by P value from a GWAS is important. It is possible to do this with least angle regression (49) in lasso regression, but due to the large number of variants, a P value from GWAS is most convenient. We also tested lasso regression on the same footing, which is the other extreme of the α setting in the elastic net. For smaller numbers of causal variants, performance was similar to that of the linear mixed model both in terms of power and false-positive rate, in some cases having slightly higher power. However, when the number of causal variants was higher, the amount of sparsity introduced was too high, reducing power below that of other methods (though false-positive rate was low in all settings for the same reason). As the number of causal variants is generally not known a priori, we would therefore always recommend the elastic net with a small α over the lasso. These results show that the variants selected by the elastic net are causal at similar rates to those with GWAS methods. The elastic net’s selected variants can be used as an effective trade-off between the regimes of the two commonly used GWAS models, having higher power than the linear mixed model and a lower false-positive rate than fixed-effect models. As many bacterial GWAS results must be followed up with lab work, these results suggest a dual approach of variable selection with the elastic net followed by ranking results with the linear mixed model may be useful when the first set of variables selected is large. This is possible in a single step in pyseer.

Whole-genome model of bacterial phenotypes enables heritability estimation and combination with association mapping.

(i) Variant selection for pneumococcal antimicrobial resistances. Next, we tested whether our method selects causal variants for some phenotypes where these are known and compared our results to those from GWAS approaches. First, we analyzed a well-studied phenotype and data set: sensitivity/resistance to β-lactams in the SPARC cohort of 603 S. pneumoniae genomes. Resistance is mainly conferred by allelic variation of three genes (pbp1a, pbp2b, and pbp2x), which are easily detected by most GWAS and machine learning methods with SNP calls as input (47, 50) Though the regions are always correctly identified, the specific variants detected are not identical between methods (51). Figure 3 shows the results of this analysis. With both the elastic net and linear mixed model plus cutoff, penA-pbp2b and pbpX-pbp2x are clearly the strongest hits. pbp1a is also selected by both methods, and while it can be seen on both Manhattan plots, it is slightly clearer in the gene summary plot for the elastic net, due to the larger number of SNPs selected in the gene. Taking hits with a P value above a threshold results in a very clean result for the linear mixed model (LMM) with this strongly selected phenotype; only a few noncausal genes are included, usually with only a few SNPs and much lower ranking than the causal genes. The elastic net selects many more noncausal SNPs across the MAF spectrum, though combining with P values and number of hits allows these to be effectively filtered. It should be noted that effect size does not appear to be an effective filtering criterion without taking into account minor allele frequency, which may have implications for other machine learning methods where a P value cannot easily be integrated. Both methods calculate comparable heritability estimates for this trait, for the LMM, h2 = 0.89, and for the elastic net, h2 = 0.81.

FIG 3

Elastic net and linear mixed model with SNP-based penicillin resistance. (Top) Manhattan plot of the selected elastic net variants, with a Bonferroni-adjusted significance threshold in red. The three biggest peaks are in the causal pbp genes. (Middle) The same result with the LMM, taking only those SNPs above the significance threshold. (Bottom) Summary of the genes selected by both methods (left, elastic net; right, LMM), maximum P value, and average absolute effect size within each gene. We then undertook a more challenging analysis, using unitigs to investigate antimicrobial resistances in the larger Maela cohort (3,162 S. pneumoniae genomes). We previously attempted GWAS on this data set using a fixed-effect GWAS model in the original description of our SEER software (12). We did not analyze tetracycline resistance or chloramphenicol resistance, as these are driven by single elements which were easily detected with the previous method. We instead used trimethoprim and erythromycin resistances. Trimethoprim resistance is expected to have two causal loci (folA-dyr and folP) due to being administered jointly with sulfamethoxazole as co-trimoxazole (52). Indeed, using our method on trimethoprim, the two causal loci are clearly identified and are the most highly ranked on a Manhattan plot (Fig. S3); applying sequence reweighting makes little difference to the result. Erythromycin resistance has multiple causal mechanisms (ermB, mel, and mef) which were not easily found in our previous attempt. Again, the erythromycin results contained many peaks in their Manhattan plots (see Fig. S4). When we mapped the unitigs directly to resistance genes, we found significant results in ermB (9 hits, minimum [min] P = 10−47), mel (12 hits, min P = 5 × 10−42), and mef (6 hits, min P = 5 × 10−42). While this was clearly more successful than our previous analysis, when considering the noise when mapping to a single reference, these causal mechanisms would not stand out (see Fig. S5). So, while our method reduced the computational burden of the analysis through the use of unitigs, it was not able to easily resolve the causal mechanisms in this challenging example. This suggests that both single variant tests and whole-genome models would struggle to arrive at true causal predictions under such circumstances. More flexible black box machine learning type approaches may help to improve prediction accuracy in these cases but were outside the scope of this study due to difficulty in interpreting the models in terms of causal variants. Manhattan plots for trimethoprim resistance in S. pneumoniae showing locations of causal loci folP and dyr. Plotted variants are unitigs selected by the elastic net using sequence reweighting; P values are from the LMM. (a) Using sequence reweighting. (b) Without sequence reweighting. Download FIG S3, PDF file, 1.0 MB. Manhattan plots for erythromycin resistance in S. pneumoniae. Plotted variants are unitigs selected by the elastic net using sequence reweighting; P values are from the LMM. (a) Using sequence reweighting. (b) Without sequence reweighting. Download FIG S4, PDF file, 1.5 MB. Summary of genes with overlapping selected unitigs in the weighted GWAS. Each point is a gene, x axis is the average effect size (beta) of unitigs covering the locus, y axis is the minimum P value of any unitig in the locus; size relates to the total number of unitigs mapped to the gene, color is the average MAF of the mapped unitigs. (a) Using sequence reweighting. (b) Without sequence reweighting. Download FIG S5, PDF file, 0.7 MB. (ii) Heritability and mapping of gonococcal resistance. We also applied our method to a combined cohort of 1,595 Neisseria gonorrhoeae genomes where resistance to five different antibiotics has been measured. These data were previously used to do GWAS using an LMM, with selected loci then entering a reduced dimension epistasis analysis (53). The mapping of resistance genes for these antibiotics using our approach was similar to this GWAS. The original analysis looked at ∼8,700 SNPs with a MAF of >0.5%; we used 5.3 × 105 unitigs with a MAF of >1%. Azithromycin (AZI) had 4,612 unitigs selected, with the top hits mapping to the four 23S rRNA sequences in the genome. The original analysis only identified a SNP in one of these repeated rRNA sequences, likely due to the impossibility of mapping variation in these repeats at a single base level: this is an advantage of being able to report multiple mappings of sequences at the final stage. Cefixime (CFX) identified the penA region, as in the original analysis, and also suggested an association in the promoter of opaD. Ciprofloxacin (CIPRO) had hits throughout the genome, as in the original analysis and similar to the analysis of erythromycin described above—combining the LMM with the elastic net may reduce candidate regions in these cases. Penicillin (PEN) had a hit in the porB region, as in the original analysis, along with hits in lgtE, mexB (and efflux pump), and a prophage. Tetracycline (TET) similarly had a replicated hit in the porB region, along with the cysN promoter, an alternative pilE allele, and rsmE (a ribosomal methyltransferase). These Manhattan plots can be seen in Fig. S6. The WHO N strain (54) contains plasmids with blaTEM and tetM, causal for PEN and TET, respectively, to which we can also map unitigs rather than needing to recall variation with respect to this reference panel. This confirms further hits to these genes. Our method therefore broadly replicated the results from the LMM and added new candidate hits due to testing unitigs rather than just SNPs, as expected. Manhattan plots for antimicrobial resistances in N. gonorrhoeae. Plotted variants are unitigs selected by the elastic net without sequence reweighting; P values are from the LMM. (a) Azithromycin. (b) Cefixime. (c) Ciprofloxacin. (d) Penicillin. (e) Tetracycline. Download FIG S6, PDF file, 2.6 MB. We also calculated the narrow sense heritability h2 using our elastic net and unitig method and compared these to those calculated with previous methods (Fig. 4). Our estimates were very similar to those of from the original paper, though consistently slightly higher (3% ± 7%), which may be a result of including more of the population variation through unitigs. Using a simple estimate of shared sequence content as the kinship matrix led to a likely overestimate of heritability. For these antibiotics, we expect high h2, approaching 1, as we discover further causal mechanisms and include them in the model (55). These estimates are consistent with this expectation but are difficult to evaluate quantitatively. It is challenging to evaluate the accuracy of heritability estimates, as the true biological value cannot easily be measured in bacteria and measurement via simulation is often circular, where methods used to generate the simulations necessarily perform best.

FIG 4

Heritability estimates for antibiotic resistance in the combined N. gonorrhoeae data set, using different methods. For each of the five antibiotic resistances measured in this data set, we report the narrow-sense heritability (h2) from our elastic net method and unitigs (gray), the limix method implemented in pyseer, using sequence distance (gold) or phylogeny (blue), and the restricted maximum likelihood (REML) approach used in the original publication (green). For limix estimates, 95% confidence intervals (CIs) were calculated with FIESTA (82). These are not shown for the phylogeny method as they span a range wider than the plot (0.11 to 1).

DISCUSSION

In this paper, we developed a microbe-specific implementation of a machine learning tool and showed how this can be used to better understand the link between bacterial genetic variation and phenotypic variation. We argue that the sophistication of the prediction model itself is generally less important than three other factors: the data set itself, creating a method with careful genomic data management, and incorporating knowledge specific to bacterial populations. We addressed these issues as follows. Pangenomic variation was covered using a unitig definition of population variation, which we showed to be scalable, unlike k-mers, and better suited to analyzing accessory genome variation and inter-data set consistency, unlike SNPs. Population structure was accounted for explicitly using sequence reweighting and leave-one-strain-out cross-validation. We maintained a clear link between our resulting models and underlying genetics by combining linear models with a suite of tools to interpret the variants selected in the model. This had the further advantage that it significantly reduced the number of sequence elements to be processed after association. Using selected unitigs allowed for a much smoother use of the interactive plotting software phandango (56), which is one of the fastest ways to interpret bacterial GWAS results. Our method is interpretable, and selected variants contain a high proportion of causal variants in simulations. We compared power and false-positive rate to selection with a P value threshold using simulations, and using real data from two species showed how this method can be combined with GWAS approaches to understand resistance and epistasis. LMMs are the best GWAS approach but mostly excel in cases where a trait is polygenic and has evolved multiple times, as is the case with many causal antibiotic resistance markers (11). If causal variants are more closely correlated with lineage, as may be the case for less strongly selected phenotypes, the LMM has reduced power, whereas the elastic net includes these groups of variants more effectively than previous fixed-effect models. This has recently been shown in an independent set of simulations (57). It is also worth noting that in highly clonal settings, the LMM can suffer from a very high false-positive rate unless the kinship matrix is carefully chosen (57, 58). Depending on the genetic architecture of the phenotype, which is rarely known a priori, one of these two methods may have a more desirable power/false-positive trade-off. The option to combine selection with the elastic net and ranking from the LMM appears to be useful in some challenging intermediate cases. An alternative approach is to associate LMM variants with lineages themselves marginally associated with the phenotype, as proposed in the bugwas package (11). The elastic net would be preferable where individual lineage associations are weak but causal, and the entire lineage block would be rejected. We also obtain useful estimates of trait heritability, some of which show evidence that previous approaches may have overestimated this quantity. For the purposes of prediction, on a simple data set, we replicated the result that regularized linear models perform similarly to more complex deep learning methods. We also applied our method to a range of data sets from different species and phenotypes, including resistance, carriage duration, and virulence. Though our models generally performed well when measured on error rates, an experiment with models on three separate cohorts showed how accuracy falls outside the target data set. External data sets may have different strain compositions due to different biases toward more- or less-virulent strains, geographical separation, vaccine use, or antibiotic consumption in the population. We would reiterate the caveat that while these models can be useful, high accuracy on test data should not be taken as a general measure of confidence (9). Batch differences such as genotyping methods between cohorts exaggerate this problem, and so a consistent approach (such as the one we provide here) should be used. Unsurprisingly, curated resistance sets—the result of decades of research—still generally perform better, although even this in silico method loses accuracy between data sets (34). Less-well-understood and potentially polygenic phenotypes such as virulence offer an attractive target for our model, as we demonstrated on two streptococcal pathogens. Along with these theoretical advances, our package has a number of practical advantages. All of the elements of our method are freely available, well documented, and part of a continuous unit-testing framework. Users can construct and evaluate models easily, without the need for programming experience, with options which retain the flexibility to modify the model parameters. There is no need for specialist hardware such as the graphics cards needed to fit large deep learning models. The models themselves are saved in a human-readable format, are easy to share and reuse, and have minimal hardware-specific requirements. Our method does remain limited due to its reliance on the elastic net. Higher order interactions are difficult to include, as they make the size of the input space increase greatly, whereas in other machine learning approaches such as random forests, these are included naturally in the model structure. Simultaneous selection of both hyperparameters α and λ is more challenging due to a greatly increased search space in cross-validation, and so we rely on a heuristic selection of α. For larger data sets, the requirement to store the entire variant matrix in memory to fit the elastic net can easily exceed available memory. This could be solved by storing this matrix as a file on a disk and memory mapping this file, or reading rows only as needed (40). Our implementation cannot currently incorporate prior information on input variants, such as known association with antimicrobial resistances, though extension to either an ensemble model or Bayesian regression would be possible. As we have introduced a general framework for variant input/output in pyseer, we hope to include further machine learning approaches which allow trade-off of these advantages and disadvantages in the future. The effect sizes from the elastic net do not necessarily indicate the significance of individual variants, as they are optimized for prediction; even after reweighting sequences, nonzero effects form a much larger set than true causal variants and are spread across the genome rather than specifically mapping to a causal region. The combination with P values from a well-performing GWAS method, such as the linear mixed model, helps if the user requires this interpretation. We would expect the same to be true for other machine learning methods and would generally caution against making a “GWAS-type” inference based on predictor importance or similar measures. We also did not perform a thorough comparison with other available machine learning methods; there are many choices, with well-known characteristics, which have previously been shown to perform similarly well at this task (10). More broadly, we have considered techniques routinely used in the analysis of modern data sets, which are generated frequently with high-throughput methods. These can be adapted to perform fundamental tasks in bacterial genomics in ways which are useful and that scale with our ambitions to discover causal drivers and predict phenotypes from genome variation. Collections of high-quality whole-genome sequences are now available at a scale that would have been unfathomable just a few years ago. Many of these data sets are publicly available already, and many more are being generated from new larger projects and routine surveillance by public health agencies. Care must be taken to ensure the unique properties of bacterial populations are properly modeled and that we use appropriate measures of success. Complex models should be compared to simple models (59, 60), not just in terms of accuracy but also for their ability to look at underlying biology. In many cases, the limiting factor is unlikely to be model flexibility. With our pangenome-spanning penalized regression models, we hope to have made useful and usable contributions that respect these principles.

MATERIALS AND METHODS

Preparation of data sets.

Table 3 shows a summary of the data sets used in this paper. Sequence assemblies were available from the original publications, with the exception of one N. gonorrhoeae study (61). For this study, we downloaded the read data, removed adapter sequences with trimmomatic (62) v0.36, and assembled them with SPAdes v3.11 (63) using the --only-assembler and --careful options. For all data sets, we then called unitigs from each sample’s sequence assembly using a k-mer length of 31, and low-frequency unitigs (AF < 1%) were discarded. For the Massachusetts and tuberculosis (TB) data sets, additional genetic data were available. For the TB data set, we used the variant call matrix provided by the study’s authors (5). Not all phenotypes were available for all samples from this study. For the Massachusetts and Maela data sets, we used SNP calls from an earlier GWAS in this population (47). Where a split into training and test data was needed, this was performed at random in the ratio 2:1. When including a cluster assignment to account for population structure, we used the previous assignments from PopPUNK where available (64). For TB, we used the major lineage as the cluster. When MICs indicated antibiotic susceptibility, this phenotype was first log transformed before any downstream analysis. Other phenotypes were used as originally reported. To generate simulated data used for testing power and false-discovery rate against a ground truth, we simulated phenotypes but used observed SNP genotype data from the Maela data set to ensure a realistic genetic model for bacterial population structure. Phenotypes were simulated using GCTA (16), either as continuous or as binary using a liability threshold model. Then, to assess the power of the methods with respect to the sample sizes, we randomly choose subsamples with 500, 1,000, 2,000, and 3,000 samples (with h2 fixed at 0.5). Separately, using 3,000 samples, h2 was varied at 0.1, 0.3, 0.6, and 0.9. We did not use fewer than 500 samples, as we expect a significant decrease in discovery power (12). LMM and fixed-effects models were run using default settings in pyseer (10 multidimensional scaling [MDS] components and kinship matrix estimated from shared distance to the root of a phylogeny). We used a fixed-effect size of 4, a relatively high effect size typical of that found for antimicrobial resistance SNPs, which have been found to range between roughly 2.5 to 7 for this data set (45), and higher than virulence effects which have been found to range between 0.1 and 1 in the Netherlands data set described above (43).

Elastic net model.

We use a high-dimensional regression model which includes all genetic variants and covariates of interest . Typically, this is not possible using classical inference methods such as the least-squares estimator, as the number of genetic predictors m exceeds the number of samples N, leading to an underconstrained system. The elastic net defines such a function (33, 65), mixing L1 (lasso) and L2 (ridge) penalties, which is minimized with respect to the values of the intercept b0 and slopes :where N is the number of samples, wi are positive weights (with a sum equal to N), l() is the link function (linear or logit for continuous or binary phenotypes yi, respectively), λ > 0 is the magnitude of the penalty, and 0 < α < 1 is the amount of mixing between L1 and L2 penalties. Minimizing this function reduces the squared distance between predicted and observed values when λ = 0; as λ increases, predictors are shrunk toward zero to trade-off prediction accuracy with overfitting. Given the strong linkage disequilibrium present in bacterial populations (11, 66), many genetic variants are strongly correlated across long distances. Randomly selecting a single representative variant from such a group will likely lead to greatly varied biological conclusions, and so including the entire block with suitable lineage annotation is preferred (11). For this reason, the elastic net has been shown to be especially useful when the variables are dependent (33). We compare this selection with the lasso in our simulations. The value of α can be changed by the user, should they wish to opt for a sparser model. Two important parameters which are not directly set using the data are λ and α. The amount of penalty λ is set by cross-validation, using a default of 10-fold to pick the value of λ, which maximizes the cross-validated R2 value of the model over the test data. The user can change the number of folds. The amount of mixing (α) between L1 penalties, which lead to sparse predictive models with mostly zero valued slopes, and L2 penalties, which lead to models with shrunken but nonzero predictors, could also be chosen by cross-validation to maximize prediction accuracy. However, we propose users choose a value of around 0.01 throughout (all experiments reported here use α = 0.01). A sparse model is preferred for prediction for speed and easier consistency between populations, but this leads to a loss of power in the context of GWAS, as potentially causally related predictors may be removed from the model (Fig. 2 and S2 in the supplemental material). Introducing even a small L1 penalty in this way removes a large fraction of variants unconnected with the phenotype. By virtue of the fact that all pangenomic variation enters this model, if a linear additive model of heritability is assumed, which it typically is in bacterial GWAS (44, 45, 67), the value of R2 calculated from the fit of the elastic net,also serves as an estimate of the narrow-sense heritability h2. As R2 measures the variance explained by the model’s predictors, in this case all genetic features, this is equivalent to the proportion of phenotypic variance explained by genetic variation , the definition of h2. This provides a lower bound on h2, because the Lasso-type estimator is biased (68), and it tends to shrink some coefficients with weak effects toward zero, though these weak effects may have a significant effect on the trait variability.

Efficiently modeling the entire pangenome.

Bacterial populations vary greatly in their sequence content, and mapping short variation within their core genes (coding sequences shared by all members of the population) is generally insufficient to capture all of the variation within the samples. In particular, accessory gene content has been shown to be associated with core variation (64), be associated with clinically relevant phenotypes (12, 69), and be useful for predicting the evolution of the population (70, 71). Early bacterial GWAS methods used k-mers, sequence words of fixed or variable length, to assay variation throughout the population independent of gene annotation or variant calling method (11, 12, 35). The set of common k-mers (1% to 99% frequency) is vast, particularly in the large and genetically diverse populations which are most amenable to GWAS. Efforts to model all of these sequence elements simultaneously are potentially computationally intractable, as these words will not fit in main memory and model fitting takes an extremely long time. In the S. pneumoniae data sets, we observed a 13- to 18-fold reduction in the number of unitigs compared to that of k-mers. We used two techniques to circumvent this issue while still including as much pangenomic variation as possible. Following the idea of screening methods in ultrahigh-dimensional data (72, 73), we used the absolute value of the sample correlation as a screening criterion for each variant: where and are the standardizations of y and x (defined as phenotype and variant as above) such thatThis is the “correlation filter.” Using a single threshold on the mean value for this correlation would lead to a large number of variants being removed before modeling, which is appealing computationally but, in our simulations, led to a loss of power. We instead removed the lowest quartile, which maintained power but did not reduce model size as much. The size of the quantile to remove can be set by the user. We also followed the method used in DBGWAS (36), which after counting fixed-length k-mers, constructs a compressed de Bruijn graph (DBG) of the population. Nodes in this compressed graph are extensions of adjacent k-mers in the raw graph with the same population frequency vector and whose sequences are referred to as unitigs. These unitigs greatly reduce the redundancy present in raw k-mer counts by combining those with the same patterns and are generally easier to functionally interpret due to their longer length. We followed the same method as step 1 of the DBGWAS package, which uses the GATB library to construct a compressed DBG (74), and then reported frequency vectors of each unitig/node and unique pattern in a format readable by pyseer. We used a k-mer length of 31 throughout to count unitigs, as this was previously shown to maximize association power (36). This length can also be set by the user. We reimplemented this approach as a standalone package (https://github.com/johnlees/unitig-counter), also including tools to extend unitigs by traversing neighboring nodes in the graph, and calculate distances between unitigs based on the graph using Dijkstra’s algorithm.

Incorporating population structure.

Population structure causes correlations between the genetic variants (unitigs) that make up . As all of these enter our model together, this effect may be implicitly controlled for without the need for further correction. We also wished to compare this to the use of an explicit correction term to test which approach is more effective. This can be included in the modeling step by a combination of three approaches: use of extra predictors in which account for population structure, modifying the per-sample weights wi, or by changing the folds used in cross-validation. We discuss each of these in turn. Fixed-effect models typically use a principal components-type analysis to include the main axes of variation in the population as covariates. In a new data set, projection of variation onto these existing axes could be used but would require large overlap between variant calls in each data set to be accurate. Random-effect models use a kinship matrix to include the sample covariance matrix in each association. For a new data set, this would require calculation of covariance against the original data set, which reduces model portability. We therefore opted to use a definition of population structure which does not introduce extra predictors. This makes application of the model more straightforward in new data sets. The use of a definition of clusters which naturally extends to new populations, which may have very different strain frequencies and/or large numbers of novel clusters, further increases robustness in the face of between-data set variation. Any method which produces discrete cluster membership definitions independent of cluster frequency is suitable for this purpose, such as sequence type, clonal complex, or percentage identity cutoff. We opted to use the “strain” definition provided by the PopPUNK software throughout our analysis due to its speed and biological basis (64) However, our implementation allows any preferred definition of cluster membership to be used. Cluster membership for each sample x is defined as Ci(x), which is 1 if x is in cluster i or 0 otherwise. The model is fit to all the data except the first cluster, and then its accuracy is tested on the cluster not used in the fitting. This is repeated for every cluster to find the value of λ which maximizes fit accuracy over all these fits. This may be referred to as leave-one-strain-out (LOSO) cross-validation or leave-one-cluster-out (LOCO) cross-validation. This is more realistic than random selection of folds, as random samples would maintain relative strain frequencies between training and test data, whereas new populations usually vary greatly in their genetic background (1, 10). Furthermore, we added the sequence-reweighting option in pyseer, which defines the sample weights as being inversely proportional to the cluster/strain size:This sequence reweighting is a commonly used definition in epistasis methods such as direct coupling analysis and correlation-based approaches, which have recently been successfully applied to genome-wide variation in bacterial populations (38, 39, 75).

Phenotypic prediction while maximizing consistency between data sets.

Prediction of unobserved phenotypes yi is achieved by forming a linear predictor of nonzero slopes in :to which the appropriate link function l() is then applied to convert into a probability (which can be converted into a binary outcome using a threshold cutoff). Predictors which are missing can either be ignored (set x = 0) or imputed (set , where is the allele frequency in the original data set). For the unitig caller approach described below, a missing call means a genuine absence in the data, and so we used the former approach. For variant calling methods where missing calls may be artifactual (such as SNP calls), we use the mean value imputation approach. The option –ignore-missing can be used to control this behavior manually. To apply fitted models to new populations, consistency in the construction of the matrix of genomic variants between data sets is important to maximize prediction accuracy. This can be highly challenging for core SNPs due to the many possible methods and individual options to extract these from sequence reads. Sequence element presence or absence is simple to define, and so it is easier to enforce consistency of input between data sets. The main issue with using unitigs here is that DBGs of different sample sets will have different sequences at their nodes and, therefore, nonoverlapping unitig calls. To solve this issue, we defined unitig sequences once in the training population. Rather than building a new DBG in test data sets, we instead checked for the presence of unitigs previously defined in the training population. To do this efficiently, we created and saved an Ferragina and Manzini (FM) index for each input sequence (76), which is a substring index capable of supporting fast text searches, which we performed in parallel for each unitig query. We implemented this as a separate package unitig-caller (https://github.com/johnlees/unitig-caller) in C++ using the SeqAn library (77). An alternative implementation of both “counting” and “calling” unitigs is possible by constructing and querying sequences using a single population DBG. We implemented this in the same package by creating an interface to the Bifrost method (78). For continuous phenotypes, we report R2 to describe prediction accuracy. This is more difficult to interpret with binary phenotypes, especially in the presence of class imbalance and when the imbalance deviates from the population-wide prevalence. For binary phenotypes, we found that reporting positive statistics such as sensitivity and specificity, or especially area under the curve (AUC), led to reports in the top decile for almost all data sets and methods and were harder to intuitively compare between different models and data sets. We therefore report the false-negative rate and false-positive rate along with the totals selected.

Implementation.

We implemented the association model in version 1.3.2 of the pyseer bacterial GWAS package, which is written in python (79). This takes care of reading variants in many common formats, including the output from unitig-counter, as well as providing tools to help interpret associated sequences. We used python bindings to the fortran glmnet package to actually fit the model, as the use of warm-starts more efficiently solves the above equation at an array of values of λ (65). Cross-validation, parallelized if requested by the user, is used to select the value of λ with the greatest R2 value, as defined above. Variant matrices are potentially very large, and so to optimize speed and memory use, we read these into a sparse matrix structure. Variants with allele frequency of >50% have their genotypes flipped to increase sparsity—these sites are flipped back during prediction. This sparse structure can be saved to disk to avoid repeated parsing of variant input files. Only haploid variant calls (0/1), allele frequency, and sample order are saved in this file. After extracting the nonzero coefficients from the fitted elastic net, the input variant file is reread with minimal parsing to output information about the selected variants. A SHA256 hash of the input file is calculated to ensure consistency with the original input file. The fitted models are saved as an associative array, with variant names as keys (either sequence or alleles combined with chromosome and position) and allele frequencies and fitted slopes as values. New variant call files are read with minimal parsing to extract just those sites which appear in the model, and at the end-of-file, the appropriate imputation procedure is applied to model terms which were not found. This allows both rapid prediction in large new data sets and an easy and portable way to share predictive models. For fitted models, slopes, P values (adjusted by any of pyseer’s other models), and allele frequencies are included in the output. Where the true phenotype is known, prediction accuracy is reported using R2 and a confusion matrix if the phenotype is binary. If clusters were provided, accuracy statistics for within each cluster are also included in the output. We used the tools within pyseer to interpret unitigs. Specifically, we used bwa-mem with the shortest possible seed to map unitig sequences to a single reference genome (Streptococcus pneumoniae ATCC 700669 [80]; Neisseria gonorrhoeae FA1090 [81], and WHO_N [54]). Bonferroni corrections were calculated by using the number of unique unitig patterns as the number of tests. The new code in pyseer includes automated tests and unit tests we wrote using test data distributed with the package. Documentation and a tutorial are available online (https://pyseer.readthedocs.io/en/master/predict.html).

Availability of data and materials.

The pyseer package is available as source code at the GitHub repository (https://github.com/mgalardini/pyseer; Apache 2.0 license), documented on readthedocs (http://pyseer.readthedocs.io/), and available for install on bioconda (https://anaconda.org/bioconda/pyseer). The unitig-counter package is available at the GitHub repository (https://github.com/johnlees/unitig-counter; AGPL 3.0 license) and available for install on bioconda (https://anaconda.org/bioconda/unitig-counter). The unitig-caller package is available at the GitHub repository (https://github.com/johnlees/unitig-caller; Apache 2.0 license) and available for install on bioconda (https://anaconda.org/bioconda/unitig-caller).

72 in total

1. Exploiting Linkage Disequilibrium for Ultrahigh-Dimensional Genome-Wide Data with an Integrated Statistical Approach.

Authors: Michelle Carlsen; Guifang Fu; Shaun Bushman; Christopher Corcoran
Journal: Genetics Date: 2015-12-12 Impact factor: 4.562

2. Discussion of "Sure Independence Screening for Ultra-High Dimensional Feature Space.

Authors: Hao Helen Zhang
Journal: J R Stat Soc Series B Stat Methodol Date: 2008-11 Impact factor: 4.488

3. Using Stochastic Approximation Techniques to Efficiently Construct Confidence Intervals for Heritability.

Authors: Regev Schweiger; Eyal Fisher; Elior Rahmani; Liat Shenhav; Saharon Rosset; Eran Halperin
Journal: J Comput Biol Date: 2018-06-22 Impact factor: 1.479

Review 4. A primer on deep learning in genomics.

Authors: James Zou; Mikael Huss; Abubakar Abid; Pejman Mohammadi; Ali Torkamani; Amalio Telenti
Journal: Nat Genet Date: 2018-11-26 Impact factor: 38.330

5. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes.

Authors: John A Lees; Minna Vehkala; Niko Välimäki; Simon R Harris; Claire Chewapreecha; Nicholas J Croucher; Pekka Marttinen; Mark R Davies; Andrew C Steer; Steven Y C Tong; Antti Honkela; Julian Parkhill; Stephen D Bentley; Jukka Corander
Journal: Nat Commun Date: 2016-09-16 Impact factor: 14.919

6. Fine-mapping inflammatory bowel disease loci to single-variant resolution.

Authors: Hailiang Huang; Ming Fang; Luke Jostins; Maša Umićević Mirkov; Gabrielle Boucher; Carl A Anderson; Vibeke Andersen; Isabelle Cleynen; Adrian Cortes; François Crins; Mauro D'Amato; Valérie Deffontaine; Julia Dmitrieva; Elisa Docampo; Mahmoud Elansary; Kyle Kai-How Farh; Andre Franke; Ann-Stephan Gori; Philippe Goyette; Jonas Halfvarson; Talin Haritunians; Jo Knight; Ian C Lawrance; Charlie W Lees; Edouard Louis; Rob Mariman; Theo Meuwissen; Myriam Mni; Yukihide Momozawa; Miles Parkes; Sarah L Spain; Emilie Théâtre; Gosia Trynka; Jack Satsangi; Suzanne van Sommeren; Severine Vermeire; Ramnik J Xavier; Rinse K Weersma; Richard H Duerr; Christopher G Mathew; John D Rioux; Dermot P B McGovern; Judy H Cho; Michel Georges; Mark J Daly; Jeffrey C Barrett
Journal: Nature Date: 2017-06-28 Impact factor: 49.962

7. A fast and agnostic method for bacterial genome-wide association studies: Bridging the gap between k-mers and genetic events.

Authors: Magali Jaillard; Leandro Lima; Maud Tournoud; Pierre Mahé; Alex van Belkum; Vincent Lacroix; Laurent Jacob
Journal: PLoS Genet Date: 2018-11-12 Impact factor: 5.917

8. Evaluation of the lasso and the elastic net in genome-wide association studies.

Authors: Patrik Waldmann; Gábor Mészáros; Birgit Gredler; Christian Fuerst; Johann Sölkner
Journal: Front Genet Date: 2013-12-04 Impact factor: 4.599

9. Trimmomatic: a flexible trimmer for Illumina sequence data.

Authors: Anthony M Bolger; Marc Lohse; Bjoern Usadel
Journal: Bioinformatics Date: 2014-04-01 Impact factor: 6.937

10. Adaptation to the cervical environment is associated with increased antibiotic susceptibility in Neisseria gonorrhoeae.

Authors: Kevin C Ma; Tatum D Mortimer; Allison L Hicks; Nicole E Wheeler; Leonor Sánchez-Busó; Daniel Golparian; George Taiaroa; Daniel H F Rubin; Yi Wang; Deborah A Williamson; Magnus Unemo; Simon R Harris; Yonatan H Grad
Journal: Nat Commun Date: 2020-08-17 Impact factor: 14.919

21 in total

1. Interpreting k-mer-based signatures for antibiotic resistance prediction.

Authors: Magali Jaillard; Mattia Palmieri; Alex van Belkum; Pierre Mahé
Journal: Gigascience Date: 2020-10-17 Impact factor: 6.524

2. Genome-wide association, prediction and heritability in bacteria with application to Streptococcus pneumoniae.

Authors: Sudaraka Mallawaarachchi; Gerry Tonkin-Hill; Nicholas J Croucher; Paul Turner; Doug Speed; Jukka Corander; David Balding
Journal: NAR Genom Bioinform Date: 2022-02-22

3. Scalable, ultra-fast, and low-memory construction of compacted de Bruijn graphs with Cuttlefish 2.

Authors: Jamshed Khan; Marek Kokot; Sebastian Deorowicz; Rob Patro
Journal: Genome Biol Date: 2022-09-08 Impact factor: 17.906

Review 4. Machine Learning for Antimicrobial Resistance Prediction: Current Practice, Limitations, and Clinical Perspective.

Authors: Jee In Kim; Finlay Maguire; Kara K Tsang; Theodore Gouliouris; Sharon J Peacock; Tim A McAllister; Andrew G McArthur; Robert G Beiko
Journal: Clin Microbiol Rev Date: 2022-05-25 Impact factor: 50.129

Review 5. Applications of Machine Learning to the Problem of Antimicrobial Resistance: an Emerging Model for Translational Research.

Authors: Melis N Anahtar; Jason H Yang; Sanjat Kanjilal
Journal: J Clin Microbiol Date: 2021-06-18 Impact factor: 5.948

10. Increased power from conditional bacterial genome-wide association identifies macrolide resistance mutations in Neisseria gonorrhoeae.

Authors: Kevin C Ma; Tatum D Mortimer; Marissa A Duckett; Allison L Hicks; Nicole E Wheeler; Leonor Sánchez-Busó; Yonatan H Grad
Journal: Nat Commun Date: 2020-10-23 Impact factor: 14.919