| Literature DB >> 24246037 |
Ali Al-Shahib, Anthony Underwood.
Abstract
BACKGROUND: A typical bacterial pathogen genome mapping project can identify thousands of single nucleotide polymorphisms (SNP). Interpreting SNP data is complex and it is difficult to conceptualise the data contained within the large flat files that are the typical output from most SNP calling algorithms. One solution to this problem is to construct a database that can be queried using simple commands so that SNP interrogation and output is both easy and comprehensible.Entities:
Mesh:
Year: 2013 PMID: 24246037 PMCID: PMC3840589 DOI: 10.1186/1471-2105-14-326
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
ST315 clade unique SNP information - typical snp-search output for –u option
| 16387 | A | G | syn | putative amino acid permease | No | | | | | |
| 512693 | A | G | non-syn | putative dipeptidase | No | T | A | No | Yes | No |
| 924493 | T | C | non-syn | putative sugar ABC transporter (ATP-binding protein) | No | S | P | No | Yes | No |
| 1108998 | G | T | non-syn | putative N-acetylglucosamine-6-phosphate isomerase | No | A | S | Yes | Yes | No |
| 1186147 | C | T | syn | putative amino acid ABC transporter (ATP-binding protein) | No | | | | | |
| 1573272 | T | C | non-syn | putative sucrose operon repressor | No | L | S | Yes | Yes | Yes |
| 1817279 | C | T | | | | | | | | |
| 1868362 | T | C |
Last two empty cells indicate the SNP has occurred in a non-CDS region therefore having no further information. Syn: synonymous, Pos: position of SNP in reference genome, AA: Amino acid, orig: original, CH: change in hydrophobicity, CP: change in polarity, CS: change in size.
Figure 1snp-search database schema.
Figure 2Snp-search machinery. Given a genbank and a VCF file, snp-search first creates the SQLite database and populates the data into the database. The user will then have the choice of various output options. Options include producing a SNP concatenated FASTA file, generating a newick-tree format file for phylogenetic analysis and a list of SNPs (depending on query) with information for every SNP (such as if the SNP is synonymous or non-synonymous).
Figure 3Maximum-Likelihood (ML) phylogenetic tree of 200 genomes based on the SNP alignment from the core genome. (A) Original maximum-likelihood tree (B) Tree that shows the shortening of the unusually long branches to the strains represented by black circles. This was made by querying the SNP database generated by snp-search.
Figure 4SNP phylogeny of 295 GAS strains. snp-search was used to generate the concatenated FASTA file and FastTree was used to generate the tree. Strains are separated according to the number of SNP differences between them. snp-search was also used to output the number of unique SNPs for each distinguishable clade by providing the name of the samples for each clade.