| Literature DB >> 32134915 |
Oliver Schwengers1,2,3, Andreas Hoek1, Moritz Fritzenwanker2,3, Linda Falgenhauer2,3, Torsten Hain2,3, Trinad Chakraborty2,3, Alexander Goesmann1,3.
Abstract
Whole genome sequencing of bacteria has become daily routine in many fields. Advances in DNA sequencing technologies and continuously dropping costs have resulted in a tremendous increase in the amounts of available sequence data. However, comprehensive in-depth analysis of the resulting data remains an arduous and time-consuming task. In order to keep pace with these promising but challenging developments and to transform raw data into valuable information, standardized analyses and scalable software tools are needed. Here, we introduce ASA3P, a fully automatic, locally executable and scalable assembly, annotation and analysis pipeline for bacterial genomes. The pipeline automatically executes necessary data processing steps, i.e. quality clipping and assembly of raw sequencing reads, scaffolding of contigs and annotation of the resulting genome sequences. Furthermore, ASA3P conducts comprehensive genome characterizations and analyses, e.g. taxonomic classification, detection of antibiotic resistance genes and identification of virulence factors. All results are presented via an HTML5 user interface providing aggregated information, interactive visualizations and access to intermediate results in standard bioinformatics file formats. We distribute ASA3P in two versions: a locally executable Docker container for small-to-medium-scale projects and an OpenStack based cloud computing version able to automatically create and manage self-scaling compute clusters. Thus, automatic and standardized analysis of hundreds of bacterial genomes becomes feasible within hours. The software and further information is available at: asap.computational.bio.Entities:
Year: 2020 PMID: 32134915 PMCID: PMC7077848 DOI: 10.1371/journal.pcbi.1007134
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Fig 2Selection of interactive GUI widgets embedded in generated HTML5 reports.
(A) Circular genome plot for a Listeria monocytogenes pseudogenome. The zoomable and scalable SVG based circular genome plot provides comprehensive information on genome features on mouseover events. Reference-guided rearranged contigs are linked to pseudogenomes for the sake of better readability. From the outermost inward: genes on the forward and reverse strand, respectively, GC content and GC skew. (B) Donut chart of MLST sequence type (ST) distribution. The MLST ST distribution of all isolates analyzed within a project is shown by and interactive donut chart. Single STs can be selected or deselected. (C) Visual representation of normalized assembly key statistics. Per-isolate assembly key statistics are normalized to minimum and maximum values within a project column-wise and visualized within an interactive data table allowing for column-based sorting and filtering for the rapid comparison of isolates and detection of outliers. (D) Antibiotic resistance profile overview widget. An antibiotic resistance profile comprising 34 distinct target drug classes is computed based on CARD annotations for each isolate and transformed into an overview widget allowing a rapid resistome comparison of all analyzed isolates. Black rectangle: a mouseover triggered tooltip describing detected antibiotic target drug resistance. (E) SNP-based approximately-maximum-likelihood phylogenetic tree. An approximately-maximum-likelihood phylogenetic tree is computed based on SNPs detected via read-mapping against a reference genome and stored in standard newick file format. The resulting tree is visualized via the interactive Phylocanvas JavaScript library providing comprehensive user interaction features, e.g. collapsing, expanding and rotating subtrees and tree type selection. (F) Parallel coordinates plot providing a multi-dimensional cohort overview of per-isolate genome metrics and characteristics. A selection of seven genome key metrics and characteristics is visualized in a parallel coordinates plot providing a multi-dimensional cohort overview enabling the rapid detection of clustered isolates and outliers. Vertical bars: key metrics or characteristic as plot dimensions; coloured horizontal lines: isolates and related values providing table-synchronized highlighting upon mouseovers.
Wall clock runtimes for each ASA3P version utilizing different hardware infrastructures and benchmark dataset sizes.
Provided are best-of-three wall clock runtimes for complete ASA3P executions analyzing Listeria monocytogenes benchmark datasets comprising 32 and 1,024 isolates given in hh:mm:ss format. Docker: a single virtual machine with 32 vCPUs and 64 GB memory was used. Analysis of the 1,024 isolate dataset was not feasible due to memory limitations; HPC: ASA3P automatically distributed the workload to an SGE-based high-performance computing cluster comprising 20 nodes providing 40 cores and 256 GB memory each; Cloud: ASA3P was executed in an OpenStack based cloud computing project comprising 560 vCPUs and 1,280 GB memory in total. Runtimes in parenthesis exclude build times for automatic infrastructure setups, i.e. the pure ASA3P wall clock runtimes.
| Docker | Cloud | HPC | |
|---|---|---|---|
| 32 | 10:59:34 | 5:02:24 | 4:49:24 |
| 1024 | - | 34:47:45 | 27:56:37 |
Common genome analysis key metrics for processing and characterization steps analyzing a benchmark dataset comprising 32 Listeria monocytogenes isolates.
Minimum and maximum values for selected common genome analysis key metrics resulting from an automatic analysis conducted with ASA3P of an exemplary benchmark dataset comprising 32 Listeria monocytogenes isolates. Metrics are given for quality control (QC), assembly, scaffolding and annotation processing steps as well as detection of antibiotic resistances and virulence factors characterization steps on a per-isolate level.
| Analysis | Metric | Minimum | Maximum |
|---|---|---|---|
| QC | reads | 393,300 | 6,315,924 |
| QC | Mean read length | 125.7 nt | 228.5 nt |
| QC | mean Phred score | 34.7 | 37.2 |
| assembly | Genome size | 2,817,892 bp | 3,201,054 bp |
| assembly | contigs | 12 | 108 |
| assembly | N50 | 56,125 bp | 1,568,056 bp |
| assembly | GC content | 37% | 38% |
| scaffolding | scaffolds | 1 | 10 |
| scaffolding | contigs | 0 | 42 |
| scaffolding | N50 | 657,549 bp | 3,034,489 bp |
| annotation | coding genes | 2,735 | 3,200 |
| annotation | non-coding genes | 95 | 144 |
| antibiotic resistance | ABR genes | 0 | 2 |
| virulence factors | VF genes | 16 | 35 |