| Literature DB >> 32636251 |
John A Lees1, T Tien Mai2, Marco Galardini3, Nicole E Wheeler4, Samuel T Horsfield5, Julian Parkhill6, Jukka Corander2,4,7.
Abstract
Discovery of genetic variants underlying bacterial phenotypes and the prediction of phenotypes such as antibiotic resistance are fundamental tasks in bacterial genomics. Genome-wide association study (GWAS) methods have been applied to study these relations, but the plastic nature of bacterial genomes and the clonal structure of bacterial populations creates challenges. We introduce an alignment-free method which finds sets of loci associated with bacterial phenotypes, quantifies the total effect of genetics on the phenotype, and allows accurate phenotype prediction, all within a single computationally scalable joint modeling framework. Genetic variants covering the entire pangenome are compactly represented by extended DNA sequence words known as unitigs, and model fitting is achieved using elastic net penalization, an extension of standard multiple regression. Using an extensive set of state-of-the-art bacterial population genomic data sets, we demonstrate that our approach performs accurate phenotype prediction, comparable to popular machine learning methods, while retaining both interpretability and computational efficiency. Compared to those of previous approaches, which test each genotype-phenotype association separately for each variant and apply a significance threshold, the variants selected by our joint modeling approach overlap substantially.IMPORTANCE Being able to identify the genetic variants responsible for specific bacterial phenotypes has been the goal of bacterial genetics since its inception and is fundamental to our current level of understanding of bacteria. This identification has been based primarily on painstaking experimentation, but the availability of large data sets of whole genomes with associated phenotype metadata promises to revolutionize this approach, not least for important clinical phenotypes that are not amenable to laboratory analysis. These models of phenotype-genotype association can in the future be used for rapid prediction of clinically important phenotypes such as antibiotic resistance and virulence by rapid-turnaround or point-of-care tests. However, despite much effort being put into adapting genome-wide association study (GWAS) approaches to cope with bacterium-specific problems, such as strong population structure and horizontal gene exchange, current approaches are not yet optimal. We describe a method that advances methodology for both association and generation of portable prediction models.Entities:
Keywords: elastic net; pangenome; phenotype prediction
Mesh:
Year: 2020 PMID: 32636251 PMCID: PMC7343994 DOI: 10.1128/mBio.01344-20
Source DB: PubMed Journal: mBio Impact factor: 7.867
Predicting antibiotic resistance in the SPARC collection using different variant types
| Variant type | Phenotype | No. selected | FPR (%) | FNR (%) | CPU time (min) | Memory usage (Gb) |
|---|---|---|---|---|---|---|
| SNPs (90,000), 3.6 Mb on disk | β-Lactam | 4,374 | 3 | 7 | 4.4 | 1.3 |
| Erythromycin | 2,341 | 3 | 63 | 4.1 | 1.3 | |
| Unitigs (730,000), 25 Mb on disk | β-Lactam | 8,247 | 5 | 7 | 49.7 | 18 |
| Erythromycin | 1,591 | 9 | 39 | 52.6 | 6.9 | |
| k-mers (10 million), 603 Mb on disk | β-Lactam | 15,121 | 6 | 7 | 420 | 212 |
Using a training/test split of 2:1, prediction accuracy of two phenotypes was tested using 90,000 SNP calls from mapping to a reference genome, and with 730,000 unitigs. We also tested prediction using 10 million variable-length k-mers to illustrate the heavy computational resource use in even a relatively small data set. File sizes are for the sparse data structures we employ.
Comparison of intra- and intercohort prediction accuracy
| Model | No. of selected unitigs (% in | Accuracy | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| SPARC data | Maela data | GPS data | ||||||||
| FNR | FPR | FNR | FPR | FNR | FPR | |||||
| Sequence reweighting | ||||||||||
| SPARC | 5,251 (10) | 0.063 | 0.024 | 0.837 | 0.007 | 0.239 | 0.439 | 0.149 | 0.134 | 0.505 |
| Maela | 6,645 (14) | 0.446 | 0.005 | 0.276 | 0.082 | 0.042 | 0.760 | 0.029 | 0.382 | 0.425 |
| GPS | 894 (4) | 0.011 | 0.411 | 0.447 | 0.144 | 0.177 | 0.458 | 0.094 | 0.200 | 0.545 |
| Without weighting | ||||||||||
| SPARC | 7,261 (10) | 0.040 | 0.013 | 0.901 | 0.012 | 0.163 | 0.487 | 0.165 | 0.130 | 0.487 |
| Maela | 8,705 (9) | 0.397 | 0.011 | 0.339 | 0.063 | 0.036 | 0.805 | 0.049 | 0.322 | 0.449 |
| GPS | 7,511 (2) | 0.050 | 0.152 | 0.656 | 0.319 | 0.026 | 0.452 | 0.129 | 0.037 | 0.864 |
For each prediction, the error rates are listed along with overall R2. For SPARC and Maela, phenotype was binary (resistant/sensitive); for GPS, phenotype was continuous (MIC). Where conversion was needed, we applied the standard breakpoint of MIC > 0.12 mg/liter for resistance.
Shaded cells are within-cohort. FNR, false-negative rate; FPR, false-positive rate.
FIG 1Prediction of carriage duration and the effect of sequence reweighting on heritability estimation. For the same training/test split, each panel shows observed log(carriage duration) values on the x axis and model-predicted values on the y axis, with a fitted linear regression. (Left) Unweighted model on the training data (top) and test (bottom). (Right) The same for the model with sequence reweighting.
Summary of data sets tested
| Data set name | Species | Phenotype(s) and split | Reference | No. of samples | No. of samples for training/test | No. of genetic features |
|---|---|---|---|---|---|---|
| TB | First-line antibiotic resistance: rifampicin, 1,285:2,257; isoniazid, 1,553:2,011; pyrazinamide, 702: 2,445; ethambutol, 975:2,551 | 3,566 | 2,377/1,189 | 6,400 (SNPs) | ||
| Antibiotic resistance MICs: azithromycin, cefixime, ciprofloxacin, penicillin, and tetracycline | 1,595 | NU | 550,000 (unitigs) | |||
| GAS | Virulence, 1,093:637 | 1,730 | 1,154/576 | 1.1 million (unitigs) | ||
| SPARC | Antibiotic resistance MICs: penicillin, erythromycin | 603 | 400/203 | 90,000 (SNPs), 730,000 (unitigs), 10 million (k-mers) | ||
| Maela | Carriage duration; antibiotic resistance: penicillin, 1,661:1,282; erythromycin, 802:2,355; trimethoprim, 609:2,548 | 3,162 (antibiotic resistance), 2,017 (carriage duration) | 1,404/703 (carriage duration) | 121,000 (SNPs), 1.6 million (unitigs) | ||
| GPS | Antibiotic resistance (penicillin) | 5,820 | NU | 1.7 million (unitigs) | ||
| Netherlands | Meningitis/carriage, 693:1,144 | 1,837 | 1,225/612 | 690,000 (unitigs) |
Each data set has a name by which it is referred to in the text. Most data sets have multiple phenotypes available, especially where multiple different antibiotic resistances are routinely phenotyped. Data sets without a training/test split were not evaluated for internal prediction ability as they were instead used with more stringent external validation data sets or were used for GWAS only, and all available samples were used to fit the model.
NU, not used.
FIG 2Power and false-positive rates in the simulation study set up to resemble antibiotic resistance genotype-phenotype architecture. (Top) The effect of sample size, with 100 causal variants in the pbp2x gene and a binary phenotype. (Bottom) The effect of phenotype heritability, with 50 causal variants spread across the three pbp genes and a continuous phenotype. Multivariate methods tested were the elastic net with default α (red) and Lasso regression (orange). Univariate methods were the fixed effects/seer model (blue) and the FastLMM linear mixed model (green).
FIG 3Elastic net and linear mixed model with SNP-based penicillin resistance. (Top) Manhattan plot of the selected elastic net variants, with a Bonferroni-adjusted significance threshold in red. The three biggest peaks are in the causal pbp genes. (Middle) The same result with the LMM, taking only those SNPs above the significance threshold. (Bottom) Summary of the genes selected by both methods (left, elastic net; right, LMM), maximum P value, and average absolute effect size within each gene.
FIG 4Heritability estimates for antibiotic resistance in the combined N. gonorrhoeae data set, using different methods. For each of the five antibiotic resistances measured in this data set, we report the narrow-sense heritability (h2) from our elastic net method and unitigs (gray), the limix method implemented in pyseer, using sequence distance (gold) or phylogeny (blue), and the restricted maximum likelihood (REML) approach used in the original publication (green). For limix estimates, 95% confidence intervals (CIs) were calculated with FIESTA (82). These are not shown for the phylogeny method as they span a range wider than the plot (0.11 to 1).