| Literature DB >> 27071527 |
Narendrakumar M Chaudhari1, Vinod Kumar Gupta1, Chitra Dutta1.
Abstract
Recent advances in ultra-high-throughput sequencing technology and metagenomics have led to a paradigm shift in microbial genomics from few genome comparisons to large-scale pan-genome studies at different scales of phylogenetic resolution. Pan-genome studies provide a framework for estimating the genomic diversity of the dataset, determining core (conserved), accessory (dispensable) and unique (strain-specific) gene pool of a species, tracing horizontal gene-flux across strains and providing insight into species evolution. The existing pan genome software tools suffer from various limitations like limited datasets, difficult installation/requirements, inadequate functional features etc. Here we present an ultra-fast computational pipeline BPGA (Bacterial Pan Genome Analysis tool) with seven functional modules. In addition to the routine pan genome analyses, BPGA introduces a number of novel features for downstream analyses like core/pan/MLST (Multi Locus Sequence Typing) phylogeny, exclusive presence/absence of genes in specific strains, subset analysis, atypical G + C content analysis and KEGG &COG mapping of core, accessory and unique genes. Other notable features include minimum running prerequisites, freedom to select the gene clustering method, ultra-fast execution, user friendly command line interface and high-quality graphics outputs. The performance of BPGA has been evaluated using a dataset of complete genome sequences of 28 Streptococcus pyogenes strains.Entities:
Mesh:
Year: 2016 PMID: 27071527 PMCID: PMC4829868 DOI: 10.1038/srep24373
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Description of BPGA Pipeline.
| Features | Description | Tools/scripts | Notes | Equivalent tools. | Citation |
|---|---|---|---|---|---|
| Preparation step | Preprocessing of raw files (.faa, .fsa or any fasta or .gbk) leading to a single | BPGA script | BPGA modifies the files by inserting genome ID into the sequence headers. | NA | This study |
| Clustering | It is used to cluster genes based on sequence similarity into orthlogous clusters. | USEARCH | USEARCH is fastest clustering tool so far. BPGA uses it as default clustering tool and can also process the clusters from other two. | Roary, PGAP, PGAT, ITEP, Panseq. | |
| Matrix Generation (Pan-Matrix) | It generates 1,0–binary presence/ absence matrix from orthlogous clusters. | BPGA script | BPGA script checks the presence or absence of genes from the individual strains and writes in the form of matrix. | Roary, PanGP, PGAP. | |
| Pan-Genome Profile Analysis | Calculates shared genes after stepwise addition of each individual genome. This trend can be plotted as Core or Pan-genome Profile Curves. | BPGA script, gnuplot. | BPGA script calculates such trends taking different permutations/combinations of genomes. | Roary, PanGP, PGAP. | |
| Phylogeny Construction | BPGA script, MUSCLE | BPGA script concatenates the core sequences from all strains and converts pan-matrix into Newick tree. MUSCLE is faster and more accurate alignment and tree generator tool. | Roary, PGAP, Panseq, ITEP. | ||
| Function and Pathway | COG and KEGG Assignments on the basis of best hits with respective reference databases. | USEARCH | Best hits are processed to get the % occurrences for all COG & KEGG pathway categories. | ||
| Pan-Genome Statistics | It provides genome wise core, accessory, unique and exclusively absent gene counts. | BPGA script | Gives an idea about contribution of each strain to the pan-genome. | None | This study |
| Atypical GC Content Analysis | Identifies genes with substantial high or low GC content from their genomic GC content. | BPGA script | Applicable only if Genbank files are used as input. | None | This study |
| Subset Analysis | Divides the original dataset into user defined smaller subsets and performs default pan genomic analyses. | BPGA script | The subsets may be based on pathogenic potential, habitat, taxonomical groups or any other criteria. | None | This study |
| Exclusive gene absence | Identifies the clusters showing exclusive absence of a gene from the specific strain. | BPGA script | Sequences of such clusters are given in output file. | None | This study |
#Automated by BPGA script.
*Supported outputs.
†These are novel features by BPGA, NA-Not Applicable.
Figure 1BPGA workflow.
Comparison of BPGA with other tools currently available for pan-genome analysis.
| Features | BPGA | Roary | PanGP | PGAP | PGAT | Panseq | ITEP |
|---|---|---|---|---|---|---|---|
| Input file/s | .GBK/.FAA/Matrix | .gff | Cluster/ Matrix | .FAA, .FFN & .PTT | × | Contig file | × |
| ✓ | × | × | × | × | × | × | |
| ✓ | × | × | × | × | × | × | |
| ✓ | × | × | × | × | × | × | |
| ✓ | × | × | × | × | × | × | |
| ✓ | × | × | × | × | × | × | |
| Pan-genome profile analysis | ✓ | ✓ | ✓ | ✓ | × | × | × |
| Size of core and pan-genome | ✓ | ✓ | × | × | × | ✓ | × |
| Extraction of core, accessory and unique genome sequence | ✓ | ✓ | × | × | × | × | ✓ |
| Evolutionary analysis | ✓ | ✓ | × | ✓ | × | ✓ | ✓ |
| Functional distribution analysis (COG) | ✓ | × | × | ✓ | ✓ | × | ✓ |
| Protein/gene clustering | ✓ | ✓ | × | ✓ | ✓ | ✓ | ✓ |
| Input data from user | ✓ | ✓ | ✓ | ✓ | × | ✓ | × |
| Operating System | Windows, Linux and any Perl supported OS. | Linux | Windows & Linux | Linux | NA | Windows & Linux | Linux |
| Mode of program | Standalone | Standalone | Standalone | Standalone | Online Database | Online & Standalone | Standalone |
| Test dataset | NA | ||||||
| No. of genomes in test dataset | 28 | 24 | 30 | 11 | NA | 11 | 24 |
| Time cost | ~3 m | >6 m | 48 sec | 29–200 m | NA | ~10 m | >1 hr |
| Performance | Very Fast | Fast | Fast | Slow | Slow | Slow | Slow |
Note: Features in bold are unique to BPGA.
Figure 2Overview of the results generated by BPGA using 28 strains of S. pyogenes.
(a) The gene family frequency spectrum. (b) New gene family distribution after sequential addition of each genome to the analysis. (c) The pan genome profile trends obtained using clustering tools- USEARCH, CD-HIT and OrthoMCL. (d) COG distribution of core, accessory and unique genes. (e) KEGG distribution of core, accessory and unique genes.
Figure 3Phylogenetic analysis by BPGA using 28 strains of S. pyogenes based on.
(a) concatenated core genes (b) concatenated housekeeping genes (MLST) (c) binary pan-matrix. (Blue: Group M1 strains and Red: Group M12 strains).