| Literature DB >> 28449639 |
Sonia Agrawal1, Cesar Arze1, Ricky S Adkins1, Jonathan Crabtree1, David Riley1, Mahesh Vangala1, Kevin Galens1, Claire M Fraser1,2, Hervé Tettelin1,2, Owen White1,3, Samuel V Angiuoli1, Anup Mahurkar1, W Florian Fricke4,5,6.
Abstract
BACKGROUND: The benefit of increasing genomic sequence data to the scientific community depends on easy-to-use, scalable bioinformatics support. CloVR-Comparative combines commonly used bioinformatics tools into an intuitive, automated, and cloud-enabled analysis pipeline for comparative microbial genomics.Entities:
Keywords: Automated analysis; Bioinformatics resource; Cloud computing; Comparative genomics; Microbial genomics; Virtual machine; Whole-genome alignment
Mesh:
Year: 2017 PMID: 28449639 PMCID: PMC5408420 DOI: 10.1186/s12864-017-3717-3
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Fig. 1CloVR-Comparative configuration screen. Three options are available to the user to identify and select annotated genome sequence data as input for CloVR-Comparative: a Using uploaded GenBank files or GenBank files generated by the CloVR-Microbe protocol both of which can be specified by so-called “tags” as described in the CloVR documentation [1]; b Through drag-and-drop in the searchable interactive interface that lists genomes available from RefSeq in a taxonomic tree format; and c By specifying a list of comma-separated GenBank accession numbers
Fig. 2Overview and flowchart of CloVR-Comparative. Input data in the form of annotated genomes in GenBank format is first validated and converted into other file formats, then used in whole-genome alignment (WGA) with Mugsy and alignment of translated CDS with MUSCLE to determine COGs. WGAs are used to identify SNPs and to predict phylogenetic relationships based on core genomic regions with Phylomark. From the results individual circular plots are generated for each input genome. The analysis output is loaded into a Sybil database to provide searches of comparative genome data in a web browser, summary and detailed results reports, and tree and circular figures
Fig. 3Example of a circular figure output. The figure was generated with Circleator using the example test dataset from the project website as input. It uses Neisseria meningitidis alpha14 (GenBank accession number: NC_013016) as a reference and depicts from outside to inside (1) complete genome (contigs of draft assemblies would be sorted by size); (2, 3) CDSs on forward and reverse strands; (4) core CDSs, defined as COGs that are shared between all input genomes; (5) unique CDSs that are only present in the reference genome (i.e. S. Typhimurium LT2 in this case); (6) unique SNPs, defined as being part of the core genome shared between all input genomes but containing a nucleotide in the reference genome that is different from all other input genomes; (7) G + C content in percent with maximum value shown as gray dotted line, calculated using non-overlapping windows of 5kbp length; and (8) GC skew, with maximum value shown as gray dotted line, calculated as (G - C) / (G + C) where G and C are nucleotide counts over non-overlapping windows of 5kbp length
Fig. 4Whole-genome alignment-based phylogenetic tree of 40 E. coli genomes from different phylogroups. Reference genomes from eight E. coli strains (see [9] for GenBank accession numbers) were used as input for CloVR-Comparative. Colored boxes and phylogroup assignments were manually added to the automatically generated tree in Newick format
Fig. 5Screenshot of the phsABCD gene cluster comparison between different S. enterica serotypes. The screenshot from the Sybil comparative analysis tool highlights the phs operon that encodes the enzymes for the anaerobic production of hydrogen sulfide from thiosulfate, which are used in anaerobic respiration. The comparison shows that of the four genes that are present in S. Typhimurium LT2, two (phsA and phsD) are missing from the two S. Paratyphi A strains AKU 12601 and ATCC 9150 and one (phsD) from the two S. Typhi strains CT18 and Ty2. The corresponding genomic regions were manually checked and confirmed to contain interrupted open reading frames in those genomes without gene calls. Gene designations in red were manually added to the screenshot that was directly copied from Sybil browser
Fig. 6Screenshot of the torSTRCAD gene cluster comparison between different S. enterica serotypes. The Sybil screenshot highlights the tor gene cluster that is responsible for the reduction of trimethylamine oxide (TMAO) to trimethylamine, which is used in anaerobic respiration. The comparison shows that of the six genes that are present in S. Typhimurium LT2 at least one, in several cases two are missing from S. Choleraesuis SC B67 (torT), S. Gallinarum 287/91 (torS), S. Paratyphi A RKS4594 (torTR), S. Typhi CT18 (torRC) and Ty2 (torR). The corresponding genomic regions were manually checked and confirmed to contain interrupted open reading frames in those genomes without gene calls. Gene designations in red were manually added to the screenshot that was directly copied from Sybil browser. Gene designations in red were manually added to the screenshot that was directly copied from Sybil browser
Input test datasets for CloVR-Comparative
| Species | Genomes | Size +/− SD [Mbp] | Core genome [Kbp] | COGs |
|---|---|---|---|---|
|
| 5 | 2.19 | 1775.09 | 1534 |
|
| 5 | 3.46 | 2838.89 | 2443 |
|
| 12 | 4.58 | 4076.66 | 2793 |
|
| 25 | 1.63 | 1294.81 | 1081 |
|
| 34 | 2.13 | 1312.15 | 742 |
|
| 28 | 4.78 | 3371.09 | 2893 |
|
| 40 | 5.05 | 2744.19 | 1708 |
Resources, runtimes and costs
| Amazon EC2 | Local desktop | ||||
|---|---|---|---|---|---|
| Dataset | Instance type | Runtime [hours] | Cost | Available resources | Runtime [hours] |
|
| c1.xlarge | 3.45 | $1.79 | 2 CPUs, 8GB RAM | 3.15 |
|
| c1.xlarge | 3.65 | $1.90 | 2 CPU, 8GB RAM | 3.67 |
|
| c1.xlarge | 6.97 | $3.62 | 2 CPUs, 8GB RAM | 6.70 |
|
| c1.xlarge | 17.08 | $8.88 | 2 CPUs, 8GB RAM | 17.72 |
|
| c1.xlarge | 16.87 | $8.77 | 2 CPUs, 8GB RAM | 15.98 |
|
| c1.xlarge | 26.10 | $13.57 | 2 CPUs, 8GB RAM | 24.42 |
|
| m1.xlarge | 56.83 | $19.89 | 4 CPUs, 16GB RAMa | 35.07 |
Amazon EC instance types
c1.xlarge, previous generation: 8 virtual CPUs, 7GB RAM, at $0.520 per hour
m1.xlarge, previous generation: 4 virtual CPUs, 15GB RAM, at, $0.350 per hour
aThe local setting for the E. coli run was simulated on a server running VMware ESX, using a virtual instance with the listed CPU and memory allocations