Literature DB >> 28637301

EDEN: evolutionary dynamics within environments.

Philipp C Münch^1,2,3, Bärbel Stecher^2,4, Alice C McHardy^1,3,5,6.

Abstract

SUMMARY: Metagenomics revolutionized the field of microbial ecology, giving access to Gb-sized datasets of microbial communities under natural conditions. This enables fine-grained analyses of the functions of community members, studies of their association with phenotypes and environments, as well as of their microevolution and adaptation to changing environmental conditions. However, phylogenetic methods for studying adaptation and evolutionary dynamics are not able to cope with big data. EDEN is the first software for the rapid detection of protein families and regions under positive selection, as well as their associated biological processes, from meta- and pangenome data. It provides an interactive result visualization for detailed comparative analyses.
AVAILABILITY AND IMPLEMENTATION: EDEN is available as a Docker installation under the GPL 3.0 license, allowing its use on common operating systems, at http://www.github.com/hzi-bifo/eden. CONTACT: alice.mchardy@helmholtz-hzi.de. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 28637301 PMCID： PMC5860032 DOI： 10.1093/bioinformatics/btx394

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Microorganisms can adapt to changing environmental conditions by evolutionary processes such as mutation, lateral gene transfer and recombination (Bendall ; Denef and Banfield, 2012; Koonin ). Only a small fraction of mutations are thought to be beneficial, and few will be fixed and contribute to the substitution rate measurable in phylogenomic studies (Bendall ; Nishant ). One of the most widely used measures for quantifying the type and extent of selection acting on a protein family site or sequence at the molecular scale is the ratio (Ford, 2002; Nielsen, 2005). If the rate of change at nonsynonymous sites (d) within a gene family exceeds the rate of change at synonymous sites (d), i.e. , positive selection is assumed to operate on the encoded protein. This indicates that adaptation to altered environmental conditions is taking place and that the observed changes increase the fitness of the respective organism. A , on the other hand, is taken as an indicator for negative selection, with changes in the protein sequence or at a site decreasing fitness (Hurst, 2002; Koonin and Rogozin, 2003), such as for instance, changes in catalytic sites that would lead to a loss of function. Calculating the ratio for the large-scale sequence datasets that are being generated in metagenomics and comparative microbial genomics is very challenging, due to the run times of commonly used tree inference software and software for quantifying positive selection, such as FastCodeML (Valle ), which relies on maximum likelihood methods (Pond and Frost, 2005). With EDEN, we provide a fully automated software package and visualization framework for a rapid meta- or pangenome wide analysis of the evolutionary processes affecting protein families and associated biological processes. EDEN is based on a fast approximate tree inference and count-based inference for individual protein families. The software can be applied to compare the selection profiles of bacterial species with different phenotypes or lifestyles, such pathogens versus mutualists, and to study selection from metagenome datasets of microbial communities (Fig. 1b).

Fig. 1

(a) Sample and data processing workflow for profiling using EDEN. To enable an interactive analysis, an RStudio Shiny server will be started inside the Docker image which is accessible by the user by a web-browser via localhost. Dashed boxes are optional input files. (b) Interactive visualization enables comparison of pooled samples. Here selected HMP samples pooled by body site are shown. (c) Example output of clusters of residues under positive selection for one gene family. Dots indicate for a given position in the protein sequence, and their color corresponds to the proportion of gaps in the MSA. Red areas indicate significant clusters of residues under positive selection. Abbreviations: UI, User Interface

2 Implementation

EDEN is provided as a Docker image, which is a virtualization of the application that includes everything needed to run the program (Merkel, 2014) (Fig. 1a). Other than Docker, no further software has been installed. EDEN requires as input (meta-)genome DNA sequences in FASTA format, from which open reading (ORF) frame DNA and protein sequences will be generated with Prodigal (Hyatt ) (Fig. 1a). Alternatively, the user can provide two files in FASTA format, corresponding to the DNA and protein sequences for a set of ORFs. Optionally one or multiple Hidden Markov Models (HMMs) can also be provided, which will be used to infer groups of ORFs, as well as a sample grouping table, with sample properties of interest, such as their origin, to enable comparative analyses of multiple input samples in groups representing these properties. The first step in the assessment of evolutionary patterns for the input sequences is their division into groups of ORFs that are subsequently processed together. Groups can either be obtained (i) based on the user-provided grouping table, (ii) by searching for protein family members with hmmsearch against user-provided hidden Markov Models (Eddy, 1998) or (iii) by searching the ORFs using hmmsearch versus the complete TIGRFAM HMM collection (Haft ). For each group, then a multiple DNA sequence alignment (MSA) is calculated with MUSCLE (Edgar, 2004). Subsequently, a multiple codon alignment is constructed using PAL2NAL 14 under consideration of the MSA and the protein sequences (Suyama ). Based on the codon MSA, a phylogenetic tree is reconstructed with an efficient implementation of the neighbor-joining algorithm using a modified version of Clearcut (Sheneman ). Specifically, we control for gaps in the alignment that are mostly of technical origin (due to the alignment of smaller assembled contigs to longer reference sequence), by excluding these from mismatch counts in calculation of the additive pairwise distance matrix. Next, is calculated using the counting method (Pond and Frost, 2005), which achieves a trade-off between the computational effort and the quality of the estimates. For calculation, the ancestral amino acid and coding sequences are reconstructed for all internal nodes of each protein family tree, using maximum parsimony as the optimization criterion. The values of d and d are then inferred from these sequences, considering the least costly of several different mutation paths between codons, as in Tusche . The ratio is then calculated using a lookup table with the probabilities that a change will cause a nonsynonymous change for all possible codon comparisons possible (Nei and Gojobori, 1986). For calculation of the average for a considered group, low-confidence positions are excluded by filtering positions from the alignment with a user-defined proportion of gaps. P values are calculated using a one-sided Fisher’s exact test, based on the d and d rates for every sequence group in comparison to the entire sample. The false discovery rate is used to control for multiple testing errors, using the Benjamini and Hochberg procedure (Benjamini and Hochberg, 1995), with α per default set to 0.05. To detect putative epitopes within a given set of homologs that are under positive selection, the P value for the sum of the d and d rates is calculated using a sliding window approach (with the size of 20 codons as a default) and a one-sided Fisher’s exact test over the MSA (Bulgarelli ; McCann ) (see Supplementary Material for details). As a ‘sanity check’, we compared the of EDEN with HyPhy SLAC, which uses a derivative of the Suzuki-Gojobori counting approach, for 50 randomly selected protein families from the HMP dataset and found a high correlation (Pearson’s R = 0.873, P value = 2.499e-16, Supplementary Fig. S2). In comparison to FastCodeML (Valle ), a run-time optimized version of the codeml program from the PAML package (Yang, 2007), EDEN has a drastically reduced run-time (Supplementary Fig. S2, see Supplementary Material for details). For further detailed analyses with FastCodeML, such as assessing selection for specific clades, the codon alignment and tree calculated by EDEN can be downloaded for individual protein families.

3 Application

We previously used EDEN to study protein families under selection from six assembled metagenome samples (150.000 ORF sequences) of the root microbiota for wild and domesticated barley (Hordeum vulgare). This delivered evidence for positive selection acting on protein families linked to pathogenesis, bacteria-phage interactions, secretion and nutrient mobilization in the barley root-associated microbiota (Bulgarelli ) and for a higher degree of selection acting on protein families from the root-associated microbiota than on those found in bulk soil. EDEN was also used to compare the selection patterns for protein families from multiple strains of Colletotrichum (Hacquard ). We applied EDEN to 66 samples of the HMP project (Consortium ) from six body sites (body sites dataset on http://eden.bifo.helmholtz-hzi.de, Fig. 1b). These were sampled from healthy individuals and had similar alpha- and beta diversities, except for the stool samples, which were more diverse (Consortium ). The results indicate a positive relationship between values and the exposure of body sites to the surrounding environment. The highest values were found for samples from the external portion of the nose (anterior nares), followed by the oral microbiome (subgingival plague, palatine tonsils, throat) and the lowest values for stool. Also in comparison of samples of one body site to all others, a significantly increased (FDR corrected P values < 0.001) was observed (in that order) for the microbiome from the external portion of the nose, subgingival plaque and throat. Interestingly, across all six body sites, most protein families with significant signs of positive selection in comparison to all other protein families from the respective samples (FDR adjusted P value <0.01) were annotated with transport and binding functions, suggesting the existence of a functional pan-selectome. Other than that, ’energy metabolism’ was found as a prominent association for the oral and nose associated microbiomes. We also used EDEN to characterize human gut metagenome samples from (Qin ) (BMI dataset on http://eden.bifo.helmholtz-hzi.de). EDEN determined a significantly higher for the protein coding genes from lean individuals (BMI < 25) compared to overweight (BMI 25-30) and obese individuals (BMI > 30; P value < 0.001), suggestive of a higher functional diversity in the guts of lean individuals. For lean individuals compared to obese individuals, this finding was consistent over all functional groups, except for regulatory functions. For these, was slightly, though not significantly, higher for the obese than for the lean individuals.

4 Conclusion

EDEN can identify protein families under positive selection from metagenome and pangenome datasets. It reports gene families and regions thereof with a significantly elevated in comparison to a specified background and allows comparative studies of multiple samples. The results obtained for metagenome samples from different demonstrate how these analyses provide insights into the relationship between the signs of molecular adaptation found in the microbiome to biological processes and their environments.

Author contributions

A.C.M. and P.C.M. conceived and designed the experiments. P.C.M. implemented the software. P.C.M. wrote the manuscript with comments from A.C.M. and B.S. All authors approved the final version of the manuscript. Click here for additional data file.

24 in total

Review 1. Horizontal gene transfer in prokaryotes: quantification and classification.

Authors: E V Koonin; K S Makarova; L Aravind
Journal: Annu Rev Microbiol Date: 2001 Impact factor: 15.500

Review 2. The Ka/Ks ratio: diagnosing the form of sequence evolution.

Authors: Laurence D Hurst
Journal: Trends Genet Date: 2002-09 Impact factor: 11.639

3. In situ evolutionary rate measurements show ecological success of recently emerged bacterial hybrids.

Authors: Vincent J Denef; Jillian F Banfield
Journal: Science Date: 2012-04-27 Impact factor: 47.728

Review 4. Molecular signatures of natural selection.

Authors: Rasmus Nielsen
Journal: Annu Rev Genet Date: 2005 Impact factor: 16.830

5. Clearcut: a fast implementation of relaxed neighbor joining.

Authors: Luke Sheneman; Jason Evans; James A Foster
Journal: Bioinformatics Date: 2006-09-18 Impact factor: 6.937

Review 6. Profile hidden Markov models.

Authors: S R Eddy
Journal: Bioinformatics Date: 1998 Impact factor: 6.937