| Literature DB >> 31791245 |
Hatice Akarsu1,2, Lisandra Aguilar-Bultet2,3,4,5, Laurent Falquet6,7.
Abstract
BACKGROUND: Comparative genomics has seen the development of many software performing the clustering, polymorphism and gene content analysis of genomes at different phylogenetic levels (isolates, species). These tools rely on de novo assembly and/or multiple alignments that can be computationally intensive for large datasets. With a large number of similar genomes in particular, e.g., in surveillance and outbreak detection, assembling each genome can become a redundant and expensive step in the identification of genes potentially involved in a given clinical feature.Entities:
Keywords: Comparative genomics; Differential gene presence/absence; RPKM
Mesh:
Year: 2019 PMID: 31791245 PMCID: PMC6889214 DOI: 10.1186/s12859-019-3234-2
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Overview of a deltaRpkm workflow. Black arrows indicate the main pipeline; dotted arrows show an alternative route with STAR. The package is written in R and takes as input a canonical coverage table, plus the design information given by the user as a metadata table. The strength of deltaRpkm relies on bypassing the tedious assembly and annotation steps typical of comparative genomics. Instead, deltaRpkm uses a basic gene read counts table (based on the mapping against a reference genome) to compute inter-group differential RPKM values per gene and outputs a list of candidate genes as present in the samples of the reference genome group (and absent from the comparison group)
Fig. 2Distribution of the median δRPKM values across all genes. For a given dataset analysis and for a given gene, the median value m of all its δRPKM is plotted (diamonds). A standard deviation of all the gene median values (s) is then used to threshold (set as 2 ∗ s by default) the significance of differential presence between the two groups of samples. Genes with a median δRPKM value m > = 2 ∗ s are considered as differentially present in the reference group. The red dotted line corresponds to 2 ∗ s. The grey dotted line corresponds to the Median Absolute Deviation (MAD). This summary plot can be produced when running the method deltaRpkm::median_plot. A dataset of size N = 51 from Listeria monocytogenes (genome size ~ 3 Mb for ~ 3 K genes) was used for the analysis represented in the figure, see [1].
Fig. 3Heatmap of the RPKM distribution of the selected genes. These genes are considered as differentially present between group 1 (samples that have the same phenotype as the reference genome) and group 2 of samples. A dataset of N = 51 of Listeria monocytogenes genomes is represented in this figure
Main functions for a differential gene presence/absence analysis with deltaRpkm. Functions are listed in the chronological order of usage
| Function name | Description | Output(s) |
|---|---|---|
| loadMetadata() | format the user metadata table | data frame of the design table |
| rpkm() | convert read counts to RPKM | data frame of RPKM values |
| deltarpkm() | compute pairwise | data frame of samples inter-group |
| deltaRPKMStats() | compute 1) median | data frame with genes annotated as differentially present in reference group 1 versus comparison group 2 |
| median_plot() | diagnostics plot to visualize the median | deltaRpkm_medians_plot.pdf file in the working directory |
| rpkmHeatmap() | heatmap of the RPKM values of the selected set of genes | deltaRpkm_heatmap.tiff file in the working directory |
Fig. 4deltaRpkm on GitHub. Content of the documentation directory for full tutorials
Runtimes of deltaRpkm pipeline, versus two most similar tools. Since deltaRpkm does not require any assembly and annotation steps, it is difficult to compare it with other methods
| Method | Small dataset | Large dataset |
|---|---|---|
| deltaRpkm | N = 225, runtime = ~ 20 min | |
| Roary [ | N = 24, runtime = ~ 6 min ( | ( |
| BPGA [ | ( | N = 1000, runtime = ~ 420 min ( |