| Literature DB >> 27633831 |
John A Lees1, Minna Vehkala2, Niko Välimäki3, Simon R Harris1, Claire Chewapreecha4, Nicholas J Croucher5, Pekka Marttinen6,7, Mark R Davies8, Andrew C Steer9,10, Steven Y C Tong11, Antti Honkela12, Julian Parkhill1, Stephen D Bentley1, Jukka Corander1,2,13.
Abstract
Bacterial genomes vary extensively in terms of both gene content and gene sequence. This plasticity hampers the use of traditional SNP-based methods for identifying all genetic associations with phenotypic variation. Here we introduce a computationally scalable and widely applicable statistical method (SEER) for the identification of sequence elements that are significantly enriched in a phenotype of interest. SEER is applicable to tens of thousands of genomes by counting variable-length k-mers using a distributed string-mining algorithm. Robust options are provided for association analysis that also correct for the clonal population structure of bacteria. Using large collections of genomes of the major human pathogens Streptococcus pneumoniae and Streptococcus pyogenes, SEER identifies relevant previously characterized resistance determinants for several antibiotics and discovers potential novel factors related to the invasiveness of S. pyogenes. We thus demonstrate that our method can answer important biologically and medically relevant questions.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27633831 PMCID: PMC5028413 DOI: 10.1038/ncomms12797
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Figure 1Power to find associations versus number of samples.
Using simulations and subsamples of the population as described in the methods, power for (a) detecting gene presence/absence at different odds ratios (b) using all informative k-mers versus a single length (c) detecting k-mers near, in the correct gene, or containing the causal variant for trimethoprim resistance. All curves are logistic fits to the mean power over 100 subsamples.
K-mers associated with antibiotic resistance.
| Chloramphenicol | 204 (7%) | 1,526 | 1,526 | 1,508—ICE | 166— |
| 288—ORF (UniParc B8ZK82) | |||||
| 206— | |||||
| Erythromycin | 803 (26%) | 1,154 | 112 | 10—permease (UniParc B8ZKV5) | 4—mega element |
| 8— | 2— | ||||
| 6— | 2—omega element | ||||
| 4—ICE | |||||
| β−lactams | 1,563 (51%) | 23,876 | 17,453 | 381—ICE | 47— |
| 145—prophage MM1 | 20— | ||||
| 50—SPN23F15110 (UniParc B8ZLE7) | 8— | ||||
| 49—ICE | |||||
| Tetracycline | 1,958 (64%) | 962 | 962 | 962—ICE | 96— |
| 136—ICE | |||||
| 121—ICE | |||||
| Trimethoprim | 2,553 (83%) | 2,639 | 210 | 21— | |
ICE, integrative conjugative element
Results from SEER for antibiotic resistance binary outcome on a population of 3069 S. pneumoniae. Significant k-mers are first interpreted by mapping to the ATCC 700669 reference genome. Up to the first four highest covered annotations are shown, and if the known mechanism is amongst these it is highlighted in bold. The ICE is the top hit in three analyses, as it carries multiple drug-resistance elements and is commonly found in multi-drug resistant strains16. The distribution of phenotype across the phylogeny is shown in Supplementary Fig. 4.
Figure 2Fine mapping trimethoprim resistance.
The locus pictured contains 72 significant k-mers, the most of any gene cluster. Coverage over the locus is pictured at the bottom of the figure. Shown above the genes are high-quality missense SNPs, plotted using their P value for affecting protein function as predicted by SIFT. Scale bar is 200 base pairs.