Literature DB >> 31114912

M1CR0B1AL1Z3R-a user-friendly web server for the analysis of large-scale microbial genomics data.

Oren Avram¹, Dana Rapoport¹, Shir Portugez¹, Tal Pupko¹.

Abstract

Large-scale mining and analysis of bacterial datasets contribute to the comprehensive characterization of complex microbial dynamics within a microbiome and among different bacterial strains, e.g., during disease outbreaks. The study of large-scale bacterial evolutionary dynamics poses many challenges. These include data-mining steps, such as gene annotation, ortholog detection, sequence alignment and phylogeny reconstruction. These steps require the use of multiple bioinformatics tools and ad-hoc programming scripts, making the entire process cumbersome, tedious and error-prone due to manual handling. This motivated us to develop the M1CR0B1AL1Z3R web server, a 'one-stop shop' for conducting microbial genomics data analyses via a simple graphical user interface. Some of the features implemented in M1CR0B1AL1Z3R are: (i) extracting putative open reading frames and comparative genomics analysis of gene content; (ii) extracting orthologous sets and analyzing their size distribution; (iii) analyzing gene presence-absence patterns; (iv) reconstructing a phylogenetic tree based on the extracted orthologous set; (v) inferring GC-content variation among lineages. M1CR0B1AL1Z3R facilitates the mining and analysis of dozens of bacterial genomes using advanced techniques, with the click of a button. M1CR0B1AL1Z3R is freely available at https://microbializer.tau.ac.il/.

Entities: Disease Species

Year: 2019 PMID： 31114912 PMCID： PMC6602433 DOI： 10.1093/nar/gkz423

Source DB: PubMed Journal: Nucleic Acids Res ISSN： 0305-1048 Impact factor: 16.971

INTRODUCTION

In a typical microbial genomics study, a few dozen bacterial samples are sequenced using next generation sequencing technologies, with each sample representing a different bacterial species, strain or isolate. The obtained reads are assembled, generating a set of contigs for each sample. This set of partially assembled genomes is then analyzed using bioinformatics tools to gain insights into the bacterial evolutionary dynamics and genomic composition of these samples. Typical research challenges are: (i) inferring the core genome and pangenome (the set of genes shared by all members of the analyzed clade and the set of genes shared by at least one member of the analyzed clade, respectively) (1); (ii) reconstructing the evolutionary history of the analyzed samples as a phylogenetic tree (2); (iii) analyzing the variation in GC content among samples (3); (iv) analyzing the gene gain and loss dynamics, which is often an indication of the intensity of horizontal gene transfer (4); (v) detecting genes that are likely to have experienced positive selection (5–7). The above computations require the use of multiple bioinformatics tools and ad-hoc programming scripts to handle information flow among the various programs, which in turn necessitates a dedicated bioinformatician to conduct such analyses. As a result, research laboratories began implementing their own in-house analysis pipelines, and later, different analysis applications began to emerge (8–10). These applications require specific working environments (i.e., operating systems), computation power (multicore machines), and more than basic technological skills (e.g., installation and running). Previously developed web tools to analyze sequenced microbial genomes are the MG-RAST, Pan-X and PGAweb web servers. MG-RAST allows finding and annotating gene functions or pathways by comparing genes to other databases (11). It differs from M1CR0B1AL1Z3R in that the latter focuses on comparing genomes rather than on their annotation and does not rely on external databases. The Pan-X web server provides ready-made examples of different microbial datasets (8). However, this web server does not allow providing unpublished genomic sequences as input. PGAweb provides several outputs, such as an analysis of the orthologous groups and reconstruction of the phylogenetic relationships among the sequences (12). However, in contrast to the M1CR0B1AL1Z3R web server, described below, it can only handle up to 50 genomic samples. In addition, phylogenetic relationships are reconstructed using neighbor joining or UPGMA, which are known to be less accurate than state-of-the-art methodologies for tree reconstruction such as maximum-likelihood and Bayesian approaches (13). Here we present the M1CR0B1AL1Z3R (pronounced: microbializer) web server. M1CR0B1AL1Z3R was developed to facilitate microbial analyses and make them more accessible to the scientific community. M1CR0B1AL1Z3R utilizes a versatile computational pipeline that runs on the cloud and provides quick and easy analyses of bacterial genomics data for all users (Figure 1). No installation and no other prerequisites are needed. Visual and textual results that are ready for publication or further analysis are given as output.

Figure 1.

M1CR0B1AL1Z3R web server workflow. MSA, multiple sequence alignment. OG, orthologous group.

MATERIALS AND METHODS

Input

The M1CR0B1AL1Z3R web server requires assembled genomic sequences (fully assembled or as contigs) from several clades. Each clade can represent a bacterial (or archaeal) isolate, strain or species. Each clade should be in a separate Fasta format file (such files are generated using assembly programs such as Velvet (14) or Canu (15)). Notably, in many metagenomic studies, the assignment of the various contigs to separate clades is unknown, and in this case, the data should be binned prior to running M1CR0B1AL1Z3R (16). To upload the files to M1CR0B1AL1Z3R, we ask the user to put them in a zipped folder (zip or tar.gz). Upon completion of the analyses, a link to the results is sent to the user if they choose to provide their email address. The results remain available on the web server for at least 3 months.

Extracting putative open reading frames (ORFs)

We extract ORFs from each genome using Prodigal (17) in ‘normal’ mode. Prodigal uses an unsupervised machine learning approach to extract protein-coding ORFs.

Extracting orthologous sets

A homology search is conducted in which each ORF is queried against all other ORFs in the database (all-against-all). Homology searches are executed using the equivalent of tBlastX in the MMSEQS2 program, which is ∼400 times faster than BLAST with similar accuracy (18). For each ORF, we record the top hit in each other genome. If ORF x in genome i is the top hit for ORF y in genome j and vice versa, these two ORFs are considered putative orthologs (best reciprocal hit, as in (19)). This pairwise analysis induces a graph in which each node is an ORF, and two nodes are connected if they are best reciprocal hits. An orthologous group is a set of nodes that are highly connected to each other and are separated from the rest of the nodes. We use the Markov Cluster (MCL) algorithm (as done in the OrthoMCL pipeline (20)) with default parameters (inflation parameter = 2.0) to detect these high-confidence orthologous groups.

Multiple sequence alignments (MSAs) and phylogenetic tree reconstruction

For each orthologous group, all sequences are first translated and the resulting protein sequences are then aligned using MAFFT, with the ‘--auto’ flag, which automatically selects an appropriate MAFFT algorithm (L-INS-i, FFT-NS-i or FFT-NS-2) according to the size of the analyzed dataset (21). Sequences are then reverse-translated so that codon-level alignments can also be computed (as in (22)). A maximum-likelihood phylogenetic tree is reconstructed based on the concatenated protein MSA of all core genes, i.e., genes shared among all strains (see below), using RAxML (23) with default parameters, the LG replacement matrix (24), and a discrete gamma distribution with four categories and an invariant category (LG+G+I) to account for among-site-rate variation (of note, we have recently shown that when searching for the maximum-likelihood tree topology, using LG+G+I provides results that are as accurate as when a model selection step is introduced, and the latter is therefore not mandatory (25)). The tree is visualized using PhyD3 (26).

GC content

For each genome, the GC content is computed from the set of ORFs using an in-house Python script.

Output

The following results are provided: (i) a text file with ORF counts per genome and its graphical representation as a violin plot; (ii) a curated file listing the orthologous sets and a histogram providing the distribution of set sizes; (iii) the unaligned sequences, the multiple sequence alignment at the protein level and the multiple sequence alignment at the codon level for each orthologous set. Both protein alignments and codon alignments are often used in downstream analyses, e.g., to find protein motifs (27) and to search for positive Darwinian selection (6), respectively. The unaligned sequences are also available if the user wishes, for example, to realign the sequences using another alignment program; (iv) a table in which each row is an orthologous group and each column is the set of genes of a specific sample (genome). The i,j entry contains the corresponding gene name of the ith group and jth sample, if such an entry exists (this is especially useful if the input includes at least one annotated genome). In addition, we provide a Fasta file with the phyletic pattern of all ortholog groups. Each record contains a sequence of ‘1’s and ‘0’s in the ith place, depending whether it has a member gene in the ith orthologous group or not, respectively (28). The generated phyletic pattern data (together with the species tree) can be further analyzed by the GLOOME web server (4), which allows inference of gene gain and loss rates, and ancestral reconstruction of these events along the species tree. In addition, we specifically provide a file with the list of ORFs shared by all samples, i.e., the orthologous group comprising the core proteome, and the concatenated protein alignment of this core proteome in Fasta format. The web server also provides means to extract the proteome shared by x% of the analyzed strains (where x = 100 is the default core proteome); (v) the phylogenetic species tree representing the evolutionary relationships between all samples, both as a text file in Newick format and using an online interactive visualizer; (vi) a text file with the GC content of each genome and a graphical representation using a violin plot.

Implementation

M1CR0B1AL1Z3R is implemented in Python 3.6. The source code is available at: https://github.com/orenavram/microbializer. The web server jobs are processed on ProLiant XL170r Gen9 servers, equipped with 128 GB RAM and 28 CPU cores per node. The Gallery, Overview, and Frequently Asked Questions (FAQ) sections of the web server should help users get the most out of the web server. A running example (different from the case studies analyzed in the Gallery) is also provided.

CASE STUDIES

The various analyses and outputs of M1CR0B1AL1Z3R are demonstrated using three datasets: (i) a set of 50 pathogenic Escherichia coli lineage ST131 genomes (29). This dataset represents highly similar clinical isolates of a specific bacterial species. We added an outgroup sequence to this dataset, the genomic sequence of Escherichia fergusonii; (ii) a collection of 73 different Escherichia genomes (72 of which are E. coli and one E. fergusonii). The 72 genomes are all fully sequenced E. coli genomes available as of December 2018 in the NCBI repository, and the E. fergusonii genome is used as an outgroup; (iii) a collection of 29 different Gammaproteobacteria genomes, taken from Pérez et al. (30). Together, these datasets demonstrate the applicability of M1CR0B1AL1Z3R for the analysis of a range of phylogenetic diversity, from different isolates to different species belonging to different bacterial orders. The complete results for these three examples are available in the Gallery section of the web server. For example, for dataset (ii), the number of ORFs varies from 3,621 to 5,592, with the smallest genome being 3,976,195 bp and the largest 5,697,240 bp. The entire set is comprised of 8,811 orthologous groups, 1,863 of which comprise the core genome. The multiple sequence alignment of the core proteome (618,921 amino acid sites) was used to reconstruct the maximum-likelihood phylogenetic tree, which is consistent with previously established E. coli phylogeny (31). The GC content of the analyzed genomes varies from 50.9 to 52.3%. The graphical outputs describing the ORF counts, orthologous group size dispersion, GC-content variation and phylogenetic relationships are shown in Figure 2.

Figure 2.

Selected visual outputs of the M1CR0B1AL1Z3R web server. Top panel (left to right): distribution of the number of ORFs in each genome; distribution of %GC in each genome; distribution of the sizes of the various orthologous groups. Bottom panel: phylogenetic tree representing the evolutionary relationships among all samples. The maximum-likelihood-based tree was reconstructed according to the core proteome as inferred from the orthologous group data.

31 in total

1. A phylogenomic approach to bacterial phylogeny: evidence of a core of genes sharing a common history.

Authors: Vincent Daubin; Manolo Gouy; Guy Perrière
Journal: Genome Res Date: 2002-07 Impact factor: 9.043

2. RevTrans: Multiple alignment of coding DNA from aligned amino acid sequences.

Authors: Rasmus Wernersson; Anders Gorm Pedersen
Journal: Nucleic Acids Res Date: 2003-07-01 Impact factor: 16.971

Review 3. The microbial pan-genome.

Authors: Duccio Medini; Claudio Donati; Hervé Tettelin; Vega Masignani; Rino Rappuoli
Journal: Curr Opin Genet Dev Date: 2005-09-26 Impact factor: 5.578

4. HyPhy: hypothesis testing using phylogenies.

Authors: Sergei L Kosakovsky Pond; Simon D W Frost; Spencer V Muse
Journal: Bioinformatics Date: 2004-10-27 Impact factor: 6.937

5. Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution.

Authors: Tal Dagan; William Martin
Journal: Proc Natl Acad Sci U S A Date: 2007-01-09 Impact factor: 11.205

6. MEGAN analysis of metagenomic data.

Authors: Daniel H Huson; Alexander F Auch; Ji Qi; Stephan C Schuster
Journal: Genome Res Date: 2007-01-25 Impact factor: 9.043

7. PAML 4: phylogenetic analysis by maximum likelihood.

Authors: Ziheng Yang
Journal: Mol Biol Evol Date: 2007-05-04 Impact factor: 16.240

8. OrthoMCL: identification of ortholog groups for eukaryotic genomes.

Authors: Li Li; Christian J Stoeckert; David S Roos
Journal: Genome Res Date: 2003-09 Impact factor: 9.043

9. Tractor_DB (version 2.0): a database of regulatory interactions in gamma-proteobacterial genomes.

Authors: Abel González Pérez; Vladimir Espinosa Angarica; Ana Tereza R Vasconcelos; Julio Collado-Vides
Journal: Nucleic Acids Res Date: 2006-11-06 Impact factor: 16.971

10. Selecton 2007: advanced models for detecting positive and purifying selection using a Bayesian inference approach.

Authors: Adi Stern; Adi Doron-Faigenboim; Elana Erez; Eric Martz; Eran Bacharach; Tal Pupko
Journal: Nucleic Acids Res Date: 2007-06-22 Impact factor: 16.971

16 in total

1. Nocardioides acrostichi sp. nov., a novel endophytic actinobacterium isolated from leaf of Acrostichum aureum.

Authors: Xiao-Hui Chen; Fei Li; Fei-Na Li; Ming-Sheng Chen; Xiao-Rui Yan; Zi-Bang He; Shun-Lian Li; You-Lang Wu; Li Tuo
Journal: Antonie Van Leeuwenhoek Date: 2021-03-01 Impact factor: 2.271

2. Inhella proteolytica sp. nov. and Inhella gelatinilytica sp. nov., two novel species of the genus Inhella isolated from aquaculture water.

Authors: Yang Liu; Tao Pei; Juan Du; Ming-Rong Deng; Honghui Zhu
Journal: Arch Microbiol Date: 2021-04-08 Impact factor: 2.552

3. Genome Analysis Coupled With Transcriptomics Reveals the Reduced Fitness of a Hot Spring Cyanobacterium Mastigocladus laminosus UU774 Under Exogenous Nitrogen Supplement.

Authors: Mayuri Mukherjee; Aribam Geeta; Samrat Ghosh; Asharani Prusty; Subhajeet Dutta; Aditya Narayan Sarangi; Smrutisanjita Behera; Siba Prasad Adhikary; Sucheta Tripathy
Journal: Front Microbiol Date: 2022-07-01 Impact factor: 6.064

4. Schumannella soli sp. nov., a novel actinomycete isolated from mangrove soil by in situ cultivation.

Authors: Feina Li; Qinpei Lu; Shuilin Liao; Li Tuo; Shaowei Liu; Qin Yang; Adong Shen; Chenghang Sun
Journal: Antonie Van Leeuwenhoek Date: 2021-08-02 Impact factor: 2.271

5. Genome Mining Revealed a High Biosynthetic Potential for Antifungal Streptomyces sp. S-2 Isolated from Black Soot.

Authors: Piotr Siupka; Artur Piński; Dagmara Babicka; Zofia Piotrowska-Seget
Journal: Int J Mol Sci Date: 2020-04-07 Impact factor: 5.923

6. Genome Mining and Evaluation of the Biocontrol Potential of Pseudomonas fluorescens BRZ63, a New Endophyte of Oilseed Rape (Brassica napus L.) against Fungal Pathogens.

Authors: Daria Chlebek; Artur Pinski; Joanna Żur; Justyna Michalska; Katarzyna Hupert-Kocurek
Journal: Int J Mol Sci Date: 2020-11-19 Impact factor: 5.923

7. Comparative Genomics of Stenotrophomonas maltophilia and Stenotrophomonas rhizophila Revealed Characteristic Features of Both Species.

Authors: Artur Pinski; Joanna Zur; Robert Hasterok; Katarzyna Hupert-Kocurek
Journal: Int J Mol Sci Date: 2020-07-12 Impact factor: 5.923

8. Comparative Genomics Analysis of Keratin-Degrading Chryseobacterium Species Reveals Their Keratinolytic Potential for Secondary Metabolite Production.

Authors: Dingrong Kang; Saeed Shoaie; Samuel Jacquiod; Søren J Sørensen; Rodrigo Ledesma-Amaro
Journal: Microorganisms Date: 2021-05-12

9. Experimental and Genomic Evaluation of the Oestrogen Degrading Bacterium Rhodococcus equi ATCC13557.

Authors: Sarah L Harthern-Flint; Jan Dolfing; Wojciech Mrozik; Paola Meynet; Lucy E Eland; Martin Sim; Russell J Davenport
Journal: Front Microbiol Date: 2021-07-01 Impact factor: 5.640

10. Draft Genome Sequence of Coralloluteibacterium stylophorae LMG 29479^T.

Authors: Andrey V Karlyshev; Ekaterina B Kudryashova; Elena V Ariskina; Ava P Conroy; Elena Y Abidueva
Journal: Microbiol Resour Announc Date: 2021-07-08