| Literature DB >> 29783945 |
Samuel T Westreich1, Michelle L Treiber1,2,3, David A Mills2,3, Ian Korf1, Danielle G Lemay4,5.
Abstract
BACKGROUND: Complex microbial communities are an area of growing interest in biology. Metatranscriptomics allows researchers to quantify microbial gene expression in an environmental sample via high-throughput sequencing. Metatranscriptomic experiments are computationally intensive because the experiments generate a large volume of sequence data and each sequence must be compared with reference sequences from thousands of organisms.Entities:
Keywords: Annotation; Bacteria; Bioinformatics; Cluster; Functions; GALAXY; Metagenomics; Metatranscriptome; Metatranscriptomics; Microbiome; Open access; Pipeline; RNA-seq; SAMSA; Software; Tool
Mesh:
Year: 2018 PMID: 29783945 PMCID: PMC5963165 DOI: 10.1186/s12859-018-2189-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1The SAMSA2 analysis pipeline. Starting sequence reads are merged, cleaned, and filtered to remove ribosomal RNA (rRNA) sequences. At the annotation step, DIAMOND can be used to incorporate any custom database as an annotation reference. Results are condensed and analyzed using custom Python scripting, and saved as standard data tables that can be imported into R to generate figures or for statistical comparison
Fig. 2a PCA and b heatmaps generated by SAMSA2 visualization scripts. Comparisons can be made using either organism or functional annotation results, or based on any other incorporated database. Both plots show how similar whole metatranscriptomes are to each other. Greater similarity is associated with a closer dots in the PCA plot or b darker blue color in the heatmap
Fig. 3Example stacked bar plot. SAMSA2’s default stacked bar graph shows both relative (top) and absolute (bottom) transcript counts per genus with samples grouped according to control or experimental metadata designations
Fig. 4Example SEED Subsystems annotation pie charts at hierarchy level 1. Pie charts or other figures can be generated for every level of SEED Subsystems hierarchy
Fig. 5Benchmarking of SAMSA2 for resource use. a DIAMOND annotation time increases linearly as more input sequences are added, allowing the estimation of total annotation time. For this test, all files ran with 30 CPUs, each with 2 GB RAM. b Annotation speed relative to allocated memory: Higher RAM allocation allows DIAMOND to hold more of the reference databases in memory, speeding up pipeline annotation up to the point where the database is fully in memory; all files in this test contained 50,000 sequences each
Runtime of SAMSA2 components vs SAMSA (Days-Hours:Minutes:Seconds)
| SAMSA2 | SAMSA2 | SAMSA2 | SAMSA2 | SAMSA2 | SAMSA 1.0 | |
|---|---|---|---|---|---|---|
| Pre-processing | RefSeq | SEED subsystems | CAZy | Total runtime | Total runtimea | |
| Average | 0–0:21:10 | 0–5:24:18 | 0–0:26:16 | 0–0:02:49 | 0–6:14:21 | 1–19:44:00 |
aBest case scenario: Public immediately after completion (highest priority)
Summary of differences between SAMSA versions
| Advantage | SAMSA2 | SAMSA 1.0 |
|---|---|---|
| Runtimea | Weeks | Months |
| Accuracy | 95% (genus level) | 91% (genus level) |
| Custom database options | Yes | No |
| Version control for reproducibility | Yes | No |
| Data slicing | Organism, group, function, taxonomy level, functional category | Organism, function |
| Figures and graphs | PCA, pie, barplot, diversity | barplot |
aRuntime dependent on number of reads. SAMSA 1.0 runtimes are longer if permission is not given to MG-RAST to make the data public immediately
| Project home page: | |
| The repository provides the necessary pipeline scripts for SAMSA2, example data files, and example workflows to demonstrate their use. Links are also provided for installation of underlying programs and requirements as listed below. | |
| Operating system(s): any supporting Python 2.7 (tested on Linux) | |
| Programming language(s): Python 2.7, R 3.0 | |
| Programs: DIAMOND=0.8.38, Trimmomatic=0.36, PEAR=0.9.8, SortMeRNA=2.1 | |
| R packages: DESeq2 [1.12.3], pheatmap [1.0.8], ggplot2 [2.1.0], RColorBrewer [1.1.2], reshape2 [1.4.1], data.table [1.9.6], knitr [1.13], vegan [2.4.0] | |
| License: The GNU General Public License, version 3 ( | |
| The datasets analyzed during the current study are available in the SAMSA2 repository, available on Github and provided in the repository. |