| Literature DB >> 29813016 |
Santeri Puranen1,2, Maiju Pesonen2,1, Johan Pensar1, Ying Ying Xu2,1, John A Lees3, Stephen D Bentley3, Nicholas J Croucher4, Jukka Corander5,1,3.
Abstract
The potential for genome-wide modelling of epistasis has recently surfaced given the possibility of sequencing densely sampled populations and the emerging families of statistical interaction models. Direct coupling analysis (DCA) has previously been shown to yield valuable predictions for single protein structures, and has recently been extended to genome-wide analysis of bacteria, identifying novel interactions in the co-evolution between resistance, virulence and core genome elements. However, earlier computational DCA methods have not been scalable to enable model fitting simultaneously to 104-105 polymorphisms, representing the amount of core genomic variation observed in analyses of many bacterial species. Here, we introduce a novel inference method (SuperDCA) that employs a new scoring principle, efficient parallelization, optimization and filtering on phylogenetic information to achieve scalability for up to 105 polymorphisms. Using two large population samples of Streptococcus pneumoniae, we demonstrate the ability of SuperDCA to make additional significant biological findings about this major human pathogen. We also show that our method can uncover signals of selection that are not detectable by genome-wide association analysis, even though our analysis does not require phenotypic measurements. SuperDCA, thus, holds considerable potential in building understanding about numerous organisms at a systems biological level.Entities:
Keywords: epistasis; linkage disequilibrium; population genomics
Mesh:
Year: 2018 PMID: 29813016 PMCID: PMC6096938 DOI: 10.1099/mgen.0.000184
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.log histograms of the cumulative distributions of estimated between-site couplings for Maela (left) and Massachusetts (right) populations. The thresholds indicate the learned boundary between negligible and moderate to strong couplings.
Fig. 2.Maela population distribution of alleles at top ranked coupled SNP sites. The estimated genome-wide maximum-likelihood phylogeny is shown on the left. Each column is labelled by the genome position, gene name and a corresponding functional categorization. Columns marked by red rectangles indicate coupled sites in pbp2x and pbp2b that have a reversed minor/major allele distribution compared with the remaining displayed SNPs in the same genes.
Fig. 3.Structural mapping of the Pbp2x (a–c) and Pbp2b (d) positions marked in Fig. 2. The panels show the transpeptidase domains of each PBP with active site residues shown in cyan and positions marked in Fig. 2 as sticks in orange or green. (a) depicts a structure-stabilizing cluster of conserved hydrophobic residues (light grey sticks) and charge interaction (dark grey) in a region proximal to (cyan cartoon) the Pbp2x active site (with bound inhibitory antibiotic as pink space-filling volume) and a mobile loop (red cartoon) covering the active site. (b) depicts the PASTA-2 domain essential for divisome complex function (green cartoon) with the bulk of the protein to the right (grey cartoon). (c) shows an overview of the Pbp2x transpeptidase domain coloured as in the detail views in (a) and (b). (d) depicts the Pbp2b transpeptidase domain region proximal to the active site with a helix (orange cartoon) mechanically connecting the active site to the 'top' of the protein. An adjacent mobile loop covering the active site is shown in red.
Fig. 4.Overlap of estimated SNP interactions between the Maela and Massachusetts populations. Each dot represents an estimated link (interaction) between two coding sequences (CDSs), the blue CDSs are involved in antibiotic resistance, and the red CDSs are in close proximity to antibiotic resistance loci. Grey dots represent other functional categories not displayed here explicitly for visual clarity. Both axes are on a log scale and the values represent numbers of links in each CDS pair.
Fig. 5.Seasonal variation of the allele frequencies for the two top cold-resistance couplings between glpF1-rnr and glpF1-lytC averaged over 3 years, 2007–2010. The shaded areas indicate 95 % confidence intervals.
Fig. 6.Estimated MI for 60 749 pairs of SNPs (Maela) and 125 469 pairs of SNPs (Massachusetts).